NiFi: ExecuteSQL Processor

In this tutorial I will guide you through how to add a processor for querying a SQL table to NiFi.

For this tutorial you will need an AVRO schema called “dttest” and it’s contents are as follows.

{
     "type": "record",
     "namespace": "com.example",
     "name": "FullName",
     "fields": [
       { "name": "id", "type": "int" },
       { "name": "name", "type": "string" }
     ]
}

First we need to drag the processor onto the grid.

Next we need select the processor ExecuteSQLRecord.

Next we must configure the processor.

 

 

 

 

 

 

 

 

Now we must create the JsonRecordWriter service.

Now we name the JsonRecordWriter

Configure the JsonWriter

Next we create the DB Connection Service

Next we name the DB Connection Service

Configure the DB Service

Now validate all the settings are as below

Now you are all done. It will now query your table.

Django: React Website

In this tutorial I will demonstrate how to create a Django + React website using Django 2.0. You must have Eclipse installed before you continue. If you have it already installed and configured you can continue on. We will require Postgres 9.4, nodejs before you continue. You can get Nodejs from here. You can get Postgres 9.4 from here.

Pip Django Install:
pip install django
pip install django-webpack-loader
Django Version:

If you are not sure what version you are running do the following

python -c "import django; print(django.get_version())"
Eclipse Create Project:

 

 

 

 

 

 

Eclipse Setup Project:

 

 

 

 

 

 

 

 

 

 

 

 

 

 

Eclipse Django DB Settings:

 

 

 

 

 

 

 

 

 

 

 

 

 

Eclipse Django Setup Successful:

Once you click “Finish” your project will look like the following.

 

 

 

Folder Structure:
  • Under djangoApp project.
  • folder: static
  • folder: djangoApp
    • folder: templates
      • file: index.html
      • folder: base
        • file: base.html
  • folder: assets
    • folder: bundles
    • folder: js
      • file: index.jsx
Node:

Inside the djangoApp application do the following

npm init
npm install --save-dev jquery react react-dom webpack webpack-bundle-tracker babel-loader babel-core babel-preset-es2015 babel-preset-react
npm install create-react-class --save
webpack.config.js:
var path = require('path')
var webpack = require('webpack')
var BundleTracker = require('webpack-bundle-tracker')

module.exports = {
    //the base directory (absolute path) for resolving the entry option
    context: __dirname,
    //the entry point we created earlier. Note that './' means 
    //your current directory.
    entry: {
		"index": [path.resolve(__dirname, "./assets/js/index.jsx")],
	},
	output: {
		path: path.resolve('./assets/bundles/'),
		filename: "[name]-[hash].js",
	},
    plugins: [
        //tells webpack where to store data about your bundles.
        new BundleTracker({filename: './webpack-stats.json'}), 
        //makes jQuery available in every module
        new webpack.ProvidePlugin({ 
            $: 'jquery',
            jQuery: 'jquery',
            'window.jQuery': 'jquery' 
        })
    ],
    module: {
        loaders: [
		{
			test: /\.jsx?$/,
			exclude: /(node_modules)/,
			loader: 'babel-loader',
			query: {
				presets: ['react','es2015']
			}
		}
        ]
    }
}
djangoApp\Settings.py:

Installed Apps

INSTALLED_APPS = [
    'django.contrib.admin',
    'django.contrib.auth',
    'django.contrib.contenttypes',
    'django.contrib.sessions',
    'django.contrib.messages',
    'django.contrib.staticfiles',
    'webpack_loader',
]

Add/Edit the following template directive

TEMPLATES = [
 {
    'BACKEND': 'django.template.backends.django.DjangoTemplates',
    'DIRS': [os.path.join(BASE_DIR, 'djangoApp', 'templates'),],
    'APP_DIRS': True,
    'OPTIONS': {
        'context_processors': [
            'django.template.context_processors.debug',
            'django.template.context_processors.request',
            'django.contrib.auth.context_processors.auth',
            'django.contrib.messages.context_processors.messages',
        ],
    },
},]

Add the following static directive

STATIC_URL = '/static/'

STATICFILES_DIRS = [
    os.path.join(BASE_DIR, 'assets'),
]

Modify DATABASES

DATABASES = {
    'default': {
        'ENGINE': 'django.db.backends.postgresql_psycopg2',
        'NAME': 'YOUR_DB_NAME',
        'USER': 'YOUR_USER',
        'PASSWORD': 'YOUR_PASSWORD',
        'HOST': 'localhost',
        'PORT': 5432
    }
}

Webpack Loader

WEBPACK_LOADER = {
    'DEFAULT': {
        'BUNDLE_DIR_NAME': 'bundles/',
        'STATS_FILE': os.path.join(BASE_DIR, 'webpack-stats.json'),
    }
}
djangoApp\views.py:

We will create our index page view. Notice the third dict. Those are variables passed to the template to make our site dynamic

from django.shortcuts import render

def index(request):
    return render(request, 'index.html', {'title': 'Index Page', 'script_name': 'index'})
djangoApp\urls.py:

Add the following imports

from django.conf.urls import url
#This is the index view we created above
from djangoApp.views import index

urlpatterns = [
    url(r'^$', index, name='index'),
    path('admin/', admin.site.urls),
]
djangoApp\templates\base\base.html:

Let’s setup our base template and setup our blocks that the other templates will inherit from.

<html>
	<head>
		<title>{% block title %}{% endblock %}</title>
	</head>
	<body>
		{% block content %}
		{% endblock %}
	</body>
</html>
djangoApp\templates\index.html:

The important parts here are the extends otherwise your base.html template won’t be inherited. As well the {% with %} and title variable makes our template dynamic and allows us to incorporate react in our site.

{% extends "base/base.html"  %}
{% load render_bundle from webpack_loader %}
{% load staticfiles %}
{% block title %}
	{{title}}
{% endblock %}
{% block content %}
	<div id="container"></div>
	{% with script=script_name %}
		{% render_bundle script 'js' %}
	{% endwith %} 
{% endblock %}
assets\js\index.jsx:

This is our react class.

var React = require('react');
var ReactDOM = require('react-dom');
var createReactClass = require('create-react-class');

var App = createReactClass({
    render: function() {
        return (
            <h1>
            React App Page
            </h1>
        )
    }
});

ReactDOM.render(<App />, document.getElementById('container'));
Database Setup/Migration:

For this tutorial we used postgres. At this time please make sure you create your djangoApp db and user you specified in the settings.py file. Then run the following commands in order.

#Migrates the auth
python manage.py migrate auth
#migrates the rest
python manage.py migrate
#Create the user for accessing the django admin ui
#This will ask you for user names and passwords. Don't make it the same as in your settings.py file.
python manage.py createsuperuser
Start Server:
webpack -p
python manage.py runserver

Your site is now running at http://localhost:8000.

Your admin site is now running at http://localhost:8000/admin/.

 

References:

I used this video as a guideline to get the project started. However some didn’t work right and needed to adjust and made adjustments to require just one template, etc.

Postgres: dblink

Sometimes we need to connect to external database from within a query. There is an easy way to do this however before doing this make sure it meets your business requirements.

SELECT *
FROM dblink('dbname=##DBNAME## port=##PORT## host=##DBSERVER## user=##DBUSER## password=##PWD##', 'SELECT my fields FROM ##TABLE##') AS remote_data(my integer, fields varchar)

Postgres: String to Rows

In this tutorial I will show you how to convert a string to rows delimited by a character.

There are three ways to do this all lead to the same answer.

--METHOD 1
SELECT split_data FROM regexp_split_to_table('I love programming!', E' ') AS split_data;

--METHOD 2
SELECT split_data 
FROM unnest(string_to_array('I love programming!',' ')) AS split_data;

--METHOD 3
SELECT split_string.arr[i] as split_data
FROM (
	SELECT generate_series(1, array_upper(arr, 1)) AS i, arr
	FROM (SELECT ARRAY['I','love','programming!'] arr) t
) split_string;
Results:

Each executes in 11ms based on my analysis. Preferably I would go with “regexp_split_to_table”

Flask: React Website

This whole tutorial describes in depth how to create a React website with Python. You must have Eclipse installed before you continue. If you have it already installed and configured you can continue on. Note that you should probably have HTML Editor and TypeScript IDE installed for Eclipse.

FYI: I am using Windows at the moment for this tutorial but you can substitute Ubuntu in if you so chose.

We will require Postgres 9.4, nodejs before you continue. You can get Nodejs from here. You can get Postgres 9.4 from here.

In this tutorial we use Flask. If you are not familiar with it go here. Flask is a lightweight Python web framework based on Werkzeug and Jinja 2.

Folder Structure:
  • You will need to create a folder called “testApp”.
  • Inside testApp create the following:
    • folder: app
      • file: models.py
      • file: __init__.py
      • folder: auth
        • file: __init__.py
        • file: views.py
      • folder: home
        • file: __init__.py
        • file: views.py
        • folder: js
          • file: home.jsx
      • folder: templates
        • file: base.html
        • file: login.html
        • file: register.html
    • file: config.py
    • file: requirements.txt
    • file: webpack.config.js
    • file: run.py
    • folder: instance
      • file: config.py
    • folder: static
      • folder: common
        • folder: css
          • file: base.css
      • file: manifest.json
    • .babelrc
Install Python Packages:

Open cmd/terminal and navigate into your testApp folder and run the following commands.

pip install flask-sqlalchemy && pip freeze > requirements.txt
pip install flask-login && pip freeze > requirements.txt
pip install flask-migrate && pip freeze > requirements.txt
pip install psycopg2 && pip freeze > requirements.txt
pip install flask-Webpack && pip freeze > requirements.txt
pip install Flask-WTF && pip freeze > requirements.txt
pip install flask-bootstrap && pip freeze > requirements.txt

Couple things to note:

  • The “&& pip freeze > requirements.txt” saves the install in requirements.txt.
  • flask-migrate: database migrations package
  • flask-sqlalchemy: model engine for your database.
  • flask-login: provides user session management for Flask
  • psycopg2: Postgres driver
  • flask-Webpack: A Flask extension to manage assets with Webpack
  • flask-WTF: flask form validation
  • flask-bootstrap: An extension that includes Bootstrap in your project, without any boilerplate code
Routing:

For the purpose of this example we are using the basic Flask implementation. IE: @home.route(‘/’). However if you want to do a more advanced routing do Resource.

Database:

Open PgAdmin III and create yourself a database called “testApp”. Also create a user with password granting access to testApp database. Permission as you see fit. Don’t forget to write down your password :).

Setup Config.py:

Read here for SqlAlchemy configruation options. DEBUG is for flask debugging.

class Config(object):
    """
    This is the shared common configurations
    """

    # Put any configurations here that are common across all environments

class DevelopmentConfig(Config):
    """
    This is the development configurations
    """
    
    DEBUG = True
    SQLALCHEMY_ECHO = True #Display queries to console

class ProductionConfig(Config):
    """
    This is the production configurations
    """
    DEBUG = False
    SQLALCHEMY_ECHO = False #Do not Display queries to console

app_config = {
    'development': DevelopmentConfig,
    'production': ProductionConfig
}

Things to Note:

Notice how we don’t have any database connections in this config.py. That is because we really shouldn’t checkin to source control our database connection strings, etc.

Setup instance config.py:

We open the file config.py inside our “instance” folder and add this line only.

SQLALCHEMY_DATABASE_URI = "postgresql://##USER##:##PASSWORD##@##HOST##/testApp"
Setup __init__.py in “app” folder:
from flask import Flask
from flask_sqlalchemy import SQLAlchemy
from flask_login import LoginManager
from flask_migrate import Migrate
from flask_bootstrap import Bootstrap
import os

#Importing from the config.py
from config import app_config

# db variable initialization
db = SQLAlchemy()

#The flask login manager
login_manager = LoginManager()

webpack = Webpack()

def create_app(config_name):
    #This will be either "development" or "production" mapped to what we write in the config.py application
    #static_folder is where the static folder will be located
    app = Flask(__name__, instance_relative_config=True, static_folder=os.path.join(os.getcwd(), "static"))
    print('Running in %s' % (config_name))
    app.config.from_object(app_config[config_name])
    app.config.from_pyfile('config.py')
    #You need a secret key to be able to utilise the database connection
    app.secret_key = 'Awesome App'
    Bootstrap(app)
    db.init_app(app)
    #This will make it so our chunked js files are able to be loaded on the template
    app.config.update({'WEBPACK_MANIFEST_PATH': '../manifest.json'})
    webpack.init_app(app)

    #if a user tries to access a page that they are not authorized to, it will redirect to the specified view and display the specified message.
    login_manager.init_app(app)
    login_manager.login_message = "You must be logged in to access this page."
    #auth.login is not the route but actually the class path.
    login_manager.login_view = "auth.login"
    
    #This let's us do our migrations
    migrate = Migrate(app, db)

    #Bring in our new tables
    from app import models

    #Our blueprints for our app

    #This is how you get authenticated
    from .auth import auth as auth_blueprint
    app.register_blueprint(auth_blueprint)
	
    #Bring in the home module
    from .home import home as home_blueprint
    app.register_blueprint(home_blueprint)

    return app
Environment Variable Setup:

Flask has two environment variables that you can set which point to the environment to use and the run.py file. These are FLASK_CONFIG and FLASK_APP. I set my environment to “development” and the app to use “run.py”. Notice that “development” maps to the key value pair in config.py.

Setup run.py:

Notice how we utilise the FLASK_CONFIG from the environment variables to setup our environment and grab the right config class.

import os #We need this to get the OS ENV VARIABLE 'FLASK_CONFIG'

#You are going to import the create_app from the __init__.py file
from app import create_app

#In our environment variables we create "FLASK_CONFIG" and set our value either development or production
config_name = os.getenv('FLASK_CONFIG')
app = create_app(config_name)

if __name__ == '__main__':
    app.run()

Now the fun really starts!!!

Setup models.py:

We setup our User model ensuring security of our password. Later on I will show you what happens with flask-migrate.

from flask_login import UserMixin
from werkzeug.security import generate_password_hash, check_password_hash

from app import db

class User(UserMixin, db.Model):
    """
    Create an Users table
    """

    # Ensures table will be named in plural and not in singular
    # as is the name of the model
    __tablename__ = 'users'

    id = db.Column(db.Integer, primary_key=True)
    email = db.Column(db.String(60), index=True, unique=True)
    username = db.Column(db.String(60), index=True, unique=True)
    first_name = db.Column(db.String(60), index=True)
    last_name = db.Column(db.String(60), index=True)
    password_hash = db.Column(db.String(128))

    @property
    def password(self):
        """
        Prevent pasword from being accessed
        """
        raise AttributeError('password is not a readable attribute.')

    @password.setter
    def password(self, password):
        """
        Set password to a hashed password
        """
        self.password_hash = generate_password_hash(password)

    def verify_password(self, password):
        """
        Check if hashed password matches actual password
        """
        return check_password_hash(self.password_hash, password)

    def __repr__(self):
        return ''.format(self.username)
Generate User Table:

Using flask-migrate we can now auto generate our User table into Postgres testApp database.

  1. Run “flask db init” to create our migration repository
    1. A “migrations” folder is created automatically.
  2. Run “flask db migrate”
    1. This generates the migration script
  3. Run “flask db upgrade”
    1. This creates the missing migrations into the database. AKA the users table.
    2. You will notice another table also got created “alembic_version”. This is how it stores the version it is at.
  4. Confirm that the db was migrated successfully.
Create Our Templates:

We use the base.html file for our react pages using the “app” id on the div. The login.html is for our login form and register.html is to register a new user. You can use the base.css file in the static/common/css folder to style it how you want. I recommend using flex.

base.html:

<!DOCTYPE html>
{% import "bootstrap/utils.html" as utils %}
{% import "bootstrap/wtf.html" as wtf %}
{% extends "bootstrap/base.html" %}
<html>
	<head>
		{% block head %}
		{{ super() }}
		<title>{{title}}</title>
		<link rel="stylesheet" type="text/css" href={{ url_for("static", filename="common/css/base.css") }} />
		{% endblock %}
	</head>
	{% block content %}
	<div class="container-app">
		<div class="container-header">Team Link</div>
	    <div class="container-body" id="app"></div>
	    <script type="text/javascript" src={{ asset_url_for(script_name) }}></script>
	</div>
	{% endblock %}
</html>

login.html:

<!DOCTYPE html>
{% import "bootstrap/utils.html" as utils %}
{% import "bootstrap/wtf.html" as wtf %}
{% extends "bootstrap/base.html" %}
<html>
	<head>
		{% block head %}
		{{ super() }}
		<title>{{title}}</title>
		<link rel="stylesheet" type="text/css" href={{ url_for("static", filename="common/css/base.css") }} />
		{% endblock %}
	</head>
	{% block content %}
	<div class="container-app">
		<div class="container-header">My Awesome App</div>
	    <div class="container-body" id="app">
			<div class="panel panel-default">
				<div class="panel-heading">
					<h3 class="panel-title">Login</h3>
				</div>
				<div class="panel-body">
					{{ wtf.quick_form(form) }}
				</div>	    
				{{ utils.flashed_messages() }}
				Click here to <a href="/register">register</a>
			</div>
		</div>
	</div>
	{% endblock %}
</html>

register.html

<!DOCTYPE html>
{% import "bootstrap/utils.html" as utils %}
{% import "bootstrap/wtf.html" as wtf %}
{% extends "bootstrap/base.html" %}
<html>
	<head>
		{% block head %}
		{{ super() }}
		<title>{{title}}</title>
		<link rel="stylesheet" type="text/css" href={{ url_for("static", filename="common/css/base.css") }} />
		{% endblock %}
	</head>
	{% block content %}
	<div class="container-app">
		<div class="container-header">Team Link</div>
	    <div class="container-body" id="app">
			<div class="panel panel-default">
				<div class="panel-heading">
					<h3 class="panel-title">Register</h3>
				</div>
				<div class="panel-body">
    				{{ wtf.quick_form(form) }}
				</div>	    
				{{ utils.flashed_messages() }}
				Click here to <a href="login">login</a>
			</div>
		</div>
	</div>
	{% endblock %}
</html>
Setup home __init__.py:

This creates the blueprint that we have in app.__init__.py.

from flask import Blueprint

home = Blueprint('home', __name__)

#This is the views.py from the home directory.
from . import views
Setup home views.py:

@login_required let’s flask know that you need to be logged in to get to this page. Don’t forget to see “render_template” method. How it has “script_name” in it and it uses base.html template. “script_name” was utilised in base.html. It brings in our js file for us on each page we go to.

from flask import render_template
from flask_login import login_required

#This is our blueprint we setup in __init__.py
from . import home

@home.route('/')
@login_required
def homepage():
    """
    Render the homepage template on the / route
    """
    return render_template('base.html', script_name='home.js', title="Welcome")
Setup React home JSX file:

React uses jsx files. So in my home directory I have a js folder and inside that we have our home.jsx file. Let’s set that up to something really basic. Remember above I said in the “render_template” we use the div id “app”. The ReactDOM will put our class in that spot. I will show you later how that is done.

var React = require("react");
var ReactDOM = require("react-dom");
var createReactClass = require("create-react-class");

var Home = createReactClass({
	render: function() {
		return (<div>Hi</div>);
	}
});

ReactDOM.render(<Home />, document.getElementById("app"));
Node Init:

We need to go to our root directory testApp from command prompt and run the following in the exact order.

npm init
npm install react react-dom --save
npm install webpack webpack-dev-server --save
npm install --save-dev babel babel-core babel-loader babel-preset-es2015 babel-preset-react
npm install create-react-class --save
npm install bootstrap --save
npm install jquery --save
npm install clean-webpack-plugin --save-dev
npm install manifest-revision-webpack-plugin --save-dev
npm install sync-exec --save-dev
npm install uglifyjs-webpack-plugin --save-dev

Things to Note:

  • npm init: Creates package.json file
  • npm install –save: will save the install to package.json
  • Each package install went into a new directory called “node_modules”.
Setup webpack.config.js:

We want to transition our jsx files to .js. Remember back in app.__init__.py we setup the static folder location. Checkout the “path” key below and now we know how it knows where it’s assets are going to be located. Our entry key value pair is the locations of each of our jsx files to create assets from. Then we have our loaders.

var path = require("path");
var webpack = require('webpack');
var ManifestRevisionPlugin = require('manifest-revision-webpack-plugin');
const CleanWebpackPlugin = require('clean-webpack-plugin');

module.exports = {
	entry: {
		"home": [path.resolve(__dirname, "./app/home/js/home.jsx")],
	},
	output: {
        path: path.join(__dirname, "static"),
		publicPath: "/static/",
		filename: "[name]-[hash].js"
	},
	plugins: [
                new CleanWebpackPlugin(["static/*.js", "static/manifest.json"], {root: __dirname, verbose: true, dry: false, exclude: ["base.css"]}),
		new ManifestRevisionPlugin(path.resolve(__dirname, "./manifest.json"), {rootAssetPath: './static', ignorePaths: ['./node_modules']}),
		new webpack.NoEmitOnErrorsPlugin(),
		new UglifyJsPlugin(),
		new webpack.optimize.AggressiveMergingPlugin(),
		new webpack.HotModuleReplacementPlugin()
	],
	module: {
		loaders: [
			{
				test: /\.jsx?$/,
				exclude: /(node_modules)/,
				loader: 'babel-loader',
				query: {
					presets: ['react','es2015']
				}
			}
		]
	}
};
.babelrc

Set the following in your file at the root directory.

{
 "presets": [ "es2015", "react" ]
}
Let’s Test Node & Webpack:

Open command prompt and navigate to our testApp folder and run “webpack”. You will notice that a “static” folder is created in our root directory. In it we will now see “home.js” file. Remember back above we set __init__.py static folder and in home.views.py file we said in render_template script_name “home.js”. This is how it all maps together.

Setup auth __init__.py:

This creates the blueprint that we have in app.__init__.py.

from flask import Blueprint

auth = Blueprint('auth', __name__)

#This is the views.py from the auth directory.
from . import views

Setup auth views.py:

from flask import flash, redirect, render_template, url_for
from flask_login import login_required, login_user, logout_user
from flask_wtf import FlaskForm
from wtforms import PasswordField, StringField, SubmitField, ValidationError
from wtforms.validators import DataRequired, Email, EqualTo
from .. import db, login_manager
from ..models import User

from . import auth

class RegistrationForm(FlaskForm):
    """
    Form for users to create new account
    """
    email = StringField('Email', validators=[DataRequired(), Email()])
    username = StringField('Username', validators=[DataRequired()])
    first_name = StringField('First Name', validators=[DataRequired()])
    last_name = StringField('Last Name', validators=[DataRequired()])
    password = PasswordField('Password', validators=[
                                        DataRequired(),
                                        EqualTo('confirm_password')
                                        ])
    confirm_password = PasswordField('Confirm Password')
    submit = SubmitField('Register')

    def validate_email(self, field):
        if User.query.filter_by(email=field.data).first():
            raise ValidationError('Email is already in use.')

    def validate_username(self, field):
        if User.query.filter_by(username=field.data).first():
            raise ValidationError('Username is already in use.')

class LoginForm(FlaskForm):
    """
    Form for users to login
    """
    email = StringField('Email', validators=[DataRequired(), Email()])
    password = PasswordField('Password', validators=[DataRequired()])
    submit = SubmitField('Login')

@login_manager.user_loader
def load_user(id):
    #This is the how we locate the user in our testApp database
    return User.query.get(int(id))

@auth.route('/register', methods=['GET', 'POST'])
def register():
    form = RegistrationForm()
    if form.validate_on_submit():
        user = User(email=form.email.data,
                            username=form.username.data,
                            first_name=form.first_name.data,
                            last_name=form.last_name.data,
                            password=form.password.data)

        # add user to the database
        db.session.add(user)
        db.session.commit()
        flash('You have successfully registered! You may now login.')

        # redirect to the login page
        return redirect(url_for('auth.login'))

    # load registration template
    return render_template('register.html', form=form, title='Register')

@auth.route('/login', methods=['GET', 'POST'])
def login():
    form = LoginForm()
    if form.validate_on_submit():
        # check whether user exists in the database and whether
        # the password entered matches the password in the database
        user = User.query.filter_by(email=form.email.data).first()
        if user is not None and user.verify_password(
                form.password.data):
            # log user in
            login_user(user)

            # redirect to the dashboard page after login
            return redirect("/")

        # when login details are incorrect
        else:
            flash('Invalid email or password.')

    # load login template
    return render_template('login.html', form=form, title='Login')

@auth.route('/logout')
@login_required
def logout():
    """
    Handle requests to the /logout route
    Log an user out through the logout link
    """
    logout_user()
    flash('You have successfully been logged out.')

    # redirect to the login page
    return redirect(url_for('auth.login'))
Let’s Run our App:

Open command prompt navigate to our testApp folder and run “flask run”. If no mistakes were made you should now be able to navigate to our site. http://localhost:5000.

 

Resources:

In learning how to incorporate flask into python I used the following part one only tutorial as a guide. Very helpful.

Postgres: Vacuum

You should set autovacuum on. If it is turned off (which is the default) it will require manual vacuuming and analyzing to be performed.

You can run vacuuming by:

VACUUM (VERBOSE, ANALYZE)

You can set auto vacuuming on by the below commands. You will need to modify the “postgresql.conf” file for this. Windows is located “C:\Program Files\PostgreSQL\9.4\data\postgresql.conf” and on Ubuntu it is “/etc/postgresql/9.4/main/postgresql.conf”. In the off chance that it isn’t then run the below first.

find / -type f -name "postgresql.conf"

Locate the following line of text “# AUTOVACUUM PARAMETERS” and apply the following. You should note that turning on automatically will also run “VACUUM ANALYZE” command to update statistics.

autovacuum = on (remove #)
autocavuum_analyze_threshold = 100
autovacuum_vacuum_threshold = 100
track_counts = on

Now you will need to restart postgresql service.

Ubuntu:

/etc/init.d/postgresql restart
invoke-rc.d postgresql restart

Windows:

services.msc restart service "postgresql-x64-9.4"

Java: Connect to Postgres

Below are the steps to setup a connection to a Postgres DB and some other options that you can use.

Pom.xml:

<dependency>
	<groupId>org.postgresql</groupId>
	<artifactId>postgresql</artifactId>
	<version>9.4.1208.jre7</version>
</dependency>

Import:

import java.sql.Connection;
import java.sql.DriverManager;
import java.sql.PreparedStatement;
import java.sql.ResultSet;
import java.sql.SQLException;

Build The Connection:

Class.forName("org.postgresql.Driver");
Connection connection = DriverManager.getConnection(##URL##, ##USER##, ##PASS##);

Preparing the Query:
We utilise “PreparedStatement” to setup the connection. Which will allow us to use parameters in the query.

PreparedStatement ps = connection.prepareStatement("select id from table where column=?")

If your query had parameters (IE: ?) then you will need to pass in the value for each parameter. If you notice there is ##POSITION## and ##VALUE##. Position is the location of where the parameter appears in the query. You can set various types of data for example integer, json, etc.

ps.setString(##POSITION##, ##VALUE##);

After we perform a query we need to retrieve the data. We use “ResultSet” for that. Depending on how we return the data depends on how we get the data after the query succeeds. Below are examples of one line returned vs multiple rows returned.

ResultSet rs = ps.executeQuery();

if (rs.next()) {
	int variable = rs.getInt("id");
}

while (rs.next()) {
	//Do Something
}

Insert:
If you want to perform an insert you don’t need to do “executeQuery” you can call just “execute”.

ps = connection.prepareStatement("insert into mytable (column) values (?)");
ps.setInt(##POSITION##, ##VALUE##);
ps.execute();

Batching:
Sometimes we have a lot of updates or inserts to perform. We should batch that.

//You setup the preparedStatement and then add to the batch.
ps.addBatch();

//You can set a max batch size to send the batch once it hits that amount
if (++count % batchSize == 0) {
	ps.executeBatch();
}

Passing Json as a Parameter:
We create the postgres object. Set the type to json and the value as a string in json format. Next we set the preparedStatement parameter as the postgres object that we just created.

final PGobject jsonObject = new PGobject();
jsonObject.setType("json");
jsonObject.setValue(value.toString());

//Now set the json object into the preparedStatement
ps.setObject(##POSITION##, jsonObject);

Postgres: Restore

Restoring a postgres instance is rather straight forward. If you have a better way please feel free to let me know.

pg_restore -c --if-exists -Fd -h ##SERVERNAME## -p ##SERVERPORT## -U ##USER## -d ##DB_NAME## -v ##FILENAME_W_PATH## > ##OUTPUT_DIR## 2>&1
  1. -Fc: Instructs how the data was compressed for restoring.
  2. -Fd: Instructs that the data was compressed by directory for restoring.
  3. –if-exists: Used with -c

Postgres: Backup

Backing up a postgres instance is rather straight forward. You can automate the backup or manual backup. I have documented either or below. If you have a better way please feel free to let me know.

Manual:

pg_dump -h ##SERVERNAME## -p ##SERVERPORT## -U ##USER## -j 8 -c --if-exists -Fd -b -v -f ##FILENAME_W_PATH## ##DB_NAME## > ##OUTPUT_LOG## 2>&1

There are a variety of options you can choose from but the above is a standard I like to use.

  1. -Fc: This compresses the data and allows for restoring via pg_restore
  2. -Fd: This compresses the data to a directory structure for restoring via pg_restore
  3. -j 8: This runs 8 concurrent jobs.
  4. -N mySchema: This excludes the schema mySchema from being backed up.
  5. –if-exists: Used with -c

Automated:

You can also use PgAgent to do scheduled backups and I recommend this however I don’t have the steps yet. That will hopefully come shortly.

You should create a backup user for the automated version. Here is what the user should have.

CREATE USER ##BACKUP_USER## WITH ENCRYPTED PASSWORD '##PASSWORD##';
GRANT CONNECT ON DATABASE ##DATABASE NAME## TO ##BACKUP_USER##;
GRANT USAGE ON SCHEMA public TO ##BACKUP_USER##;
GRANT SELECT ON ALL SEQUENCES IN SCHEMA public TO ##BACKUP_USER##;
GRANT SELECT ON ALL TABLES IN SCHEMA public TO ##BACKUP_USER##;

Whichever OS you are using there is a different way to set it up.

Windows:
  1. You need to create “pgpass.conf” file in “%APPDATA%\postgresql” directory.
  2. pgpass should contain connection information such as “hostname:port:database:username:password”
  3. You can then create a batch file to run pg_dump. Example of such a batch file below. You could also write a PS script which is the preferred direction to go.
 @echo off
   for /f "tokens=1-4 delims=/ " %%i in ("%date%") do (
     set day=%%i
     set month=%%j
     set year=%%k
   )

   for /f "tokens=1-4 delims=: " %%i in ("%time%") do (
     set hour=%%i
     set minute=%%j
     set second=%%k
   )

   set backupDir=\\NETWORK_SHARE\%month%_%day%_%year%
   set backupFile=DBName_%month%_%day%_%year%_%hour%_%minute%_%second%.backup

   if not exist %backupDir% (mkdir %backupDir%)

   C:\"Program Files"\PostgreSQL\9.4\bin\pg_dump.exe -h ##SERVERNAME## -p ##SERVERPORT## -U ##BACKUP_USER## -j 8 -c --if-exists -Fd -b -v -f %backupDir%\%backupFile% ##DB_NAME## > ##OUTPUT_LOG## 2>&1
Ubuntu:

You need to create the .pgpass file in the users home directory. It should contain information such as hostname:port:database:username:password. You will also need to modify the permissions to be owner read/write.

touch .pgpass
chmod 600 .pgpass

Next we need to set the crontab job. By entering “crontab -e”. You can call the backup.sh file on whatever schedule you want.

#! /bin/sh
currentDate=$(date +"%m_%d_%Y")
currentDateTime=$currentDate$(date +"_%H_%M_%S")

if [ ! -d ##NETWORK_SHARE##/##DB_NAME## ]
then
        mkdir ##NETWORK_SHARE##/##DB_NAME##
fi

if [ ! -d ##NETWORK_SHARE##/##DB_NAME##/$currentDate ]
then
        echo $currentDate
        mkdir ##NETWORK_SHARE##/##DB_NAME##/$currentDate
fi

pg_dump -h ##SERVERNAME## -p ##SERVERPORT## -U ##BACKUP_USER## -j 8 -c --if-exists -Fd -b -v -f ##NETWORK_SHARE##/##DB_NAME##/$currentDate/$currentDateTime.backup ##DB_NAME## > ##OUTPUT_LOG## 2>&1

Postgres: Functions/Triggers

I usually use pgsql to write my functions.

Example Update Function:
This function could be used as a save/creation of records. It passes back a boolean and id.

 CREATE OR REPLACE FUNCTION Public.fn_my_function(param1 integer, param2 integer, OUT Success Boolean, OUT ID integer)
AS $body$
DECLARE
      any_variable integer;
BEGIN
      $3 = True;
      $4 = 3234;
    

      EXCEPTION 
            $3 = False;
            $4 = -1;
            WHEN OTHERS THEN 
            RAISE WARNING '[Public.fn_my_function] - UDF ERROR [OTHER] - SQLSTATE: %, SQLERRM: %',SQLSTATE,SQLERRM;
END
$body$ LANGUAGE plpgsql
SECURITY INVOKER
SET search_path = Public;

 

Get Function:
You can also return a table.
IE: RETURNS TABLE(id integer, data json) AS $body$

 CREATE OR REPLACE FUNCTION Public.fn_get_function()
RETURNS SETOF Public.My_Table AS $body$
BEGIN
      RETURN QUERY 
      SELECT * 
      FROM public.my_table t
      ORDER BY t.Name;
END
$body$ LANGUAGE plpgsql
SECURITY INVOKER
SET search_path = Public;

 

Comment Function:

 COMMENT ON FUNCTION Public.fn_my_function(integer, integer) IS '';

 

Drop Function:
Notice how I use “IF EXISTS”. This is best because if it didn’t exist your script would fail.

 DROP FUNCTION IF EXISTS Public.fn_my_function(Integer, Integer, Integer);

 

Trigger:

 CREATE OR REPLACE FUNCTION Public.fn_my_function() RETURNS TRIGGER AS $BODY$
DECLARE
    v_old_data json;
    v_new_data json;
BEGIN
    IF (TG_OP = 'UPDATE') THEN
      v_old_data := row_to_json(OLD.*);
      v_new_data := row_to_json(NEW.*);
      RETURN NEW;
    ELSIF (TG_OP = 'DELETE') THEN
      v_old_data := row_to_json(OLD.*);
      RETURN OLD;
    ELSIF (TG_OP = 'INSERT') THEN
      v_new_data := row_to_json(NEW.*);
      RETURN NEW;
    ELSE
      RAISE WARNING '[Public.fn_my_function] - Other action occurred: %, at %',TG_OP,now();
      RETURN NULL;
    END IF;
       
    EXCEPTION
      WHEN data_exception THEN
            RAISE WARNING '[Public.fn_my_function] - UDF ERROR [DATA EXCEPTION] - SQLSTATE: %, SQLERRM: %',SQLSTATE,SQLERRM;
            RETURN NULL;
      WHEN unique_violation THEN
            RAISE WARNING '[Public.fn_my_function] - UDF ERROR [UNIQUE] - SQLSTATE: %, SQLERRM: %',SQLSTATE,SQLERRM;
            RETURN NULL;
      WHEN OTHERS THEN
            RAISE WARNING '[Public.fn_my_function] - UDF ERROR [OTHER] - SQLSTATE: %, SQLERRM: %',SQLSTATE,SQLERRM;
            RETURN NULL;
END;
$BODY$
LANGUAGE plpgsql
SECURITY INVOKER
SET search_path = Public;

Postgres: Check Data Type

When you deal with json data type there is the option of using json_typeof(value). However this doesn’t always give you what you expect. For example a date field renders as text. To get around this I have the following two functions. Feel free to use them or suggest alternatives.

Test Value is Date:

 create or replace function is_date(s varchar) returns boolean as $$
begin
  perform s::date;
  return true;
exception when others then
  return false;
end;
$$ language plpgsql;

 

Test Json Value for Type:

 CREATE OR REPLACE FUNCTION get_json_type(s json) returns text as $$
DECLARE val_type text;
BEGIN
     IF (SELECT public.is_date(s::text)) THEN
        val_type := 'date';
     ELSEIF (json_typeof(s) = 'number') THEN
        val_type := 'double precision';
     ELSEIF (json_typeof(s) = 'string') THEN
        val_type := 'text';
     ELSEIF (json_typeof(s) = 'boolean') THEN
        val_type := 'boolean::integer';
     ELSE
        val_type := json_typeof(s);
     END IF;

     return val_type;
END;
$$ language plpgsql;

 

Postgres: Misc

Here are some misc things you can do in postgres 9.4.

Print to Console:

 raise notice 'VALUE: %', value;

Json Build Object:

 json_build_object('name', value, 'name2': value2)

Json Array:

 json_agg(data)

Extract Month:
You can also extract year, day, etc.

 extract(month from date_column)

Json Querying:
You can extract a field from a json field. You can do:

 rec->>'id'
rec#>'{field, sub_field,value}'
rec->'object'

Row To Json:

 SELECT row_to_json(t)
FROM mytable t

Update With FROM:

 UPDATE mytable l
SET old_val_id=sub.new_val_id
FROM (
     SELECT l2.id, s.id as new_val_id
     FROM mytable l2
          INNER JOIN mysupertable s ON s.id=l2.old_val_id
) sub
WHERE sub.id=l.id;

Inline Script:

 DO $do$
DECLARE id integer;
BEGIN      

END $do$ LANGUAGE plpgsql;

If:

 IF CONDITION THEN
END IF;

Let’s say you have a json field and in that field you have a field which tells you what key you should select for certain data. It could be customizable from the user entering data. To dynamically select that field you can do the below.

jsonField->((to_json((jsonField->'key'->'sub_key'))#>>'{}')::text)

Upsert:
Sometimes we want to perform update if the record exists or an insert if it doesn’t. Most of the time we write a function that deals with this on a record by record basis. But what if we want to do a batch update or insert. What do we do then. We perform an upsert. See below. We write the update and return the result in the remainder of the query  we check if upsert gave us anything. If it didn’t we perform the insert.

WITH upsert AS (UPDATE MyTable mt
            SET column_name = sub.new_value
            FROM (SELECT mot.id, mot.new_value 
                 FROM MyOtherTable mot
                 ) sub
            WHERE sub.id=mt.related_id 
RETURNING *) 
INSERT INTO MyTable (id, column_name, related_id) 
SELECT ?, ?, ? 
WHERE NOT EXISTS (SELECT * FROM upsert)

Regex Substring:
There are different ways of using regex. This is one way.

substring(column_name from '([0-9]{4}-[0-9]{2}-[0-9]{2})')

PGSQL Loop:
This is one way to loop using pgsql.

DO $do$
DECLARE rec RECORD;
BEGIN
	FOR rec IN SELECT * FROM MyTable
	LOOP
		--We can then use rec like rec.colum_name
	END LOOP;
END $do$ LANGUAGE plpgsql;

Milliseconds to timestamp:

This will return 2017-08-17 21:26:04

select to_timestamp(1503005165);

Postgres: Tables

Below are some common functions for doing table creation and maintenance.

Table Creation:

 CREATE TABLE Public.mytable (
      id BigSerial PRIMARY KEY,
      text_column varchar NOT NULL,
      int_column Integer NOT NULL,
      date_column timestamp NOT NULL DEFAULT CURRENT_TIMESTAMP
);

Create Schema:

 CREATE SCHEMA IF NOT EXISTS test;

Create Schema with Authorization:

CREATE SCHEMA IF NOT EXISTS test AUTHORIZATION myUser;

Drop Schema Cascade:

DROP SCHEMA IF EXISTS test CASCADE;

Comment On Table:

 COMMENT ON TABLE Public.mytable IS 'A List of data.';

Vacuum:
vacuum has options best to review them.

 vacuum (analyze,verbose);

Drop Constraint:

 ALTER TABLE mytable DROP CONSTRAINT mytable_id_pkey;

Add Constraint:

 ALTER TABLE public.mytable ADD CONSTRAINT mytable_id_pkey PRIMARY KEY (id);

Rename Constraint:

 ALTER TABLE mytable RENAME CONSTRAINT "mytable_id2_fkey" TO "mytable_id3__fkey";

Rename Table Column:

 ALTER TABLE mytable RENAME COLUMN text_column TO text_column2;

Rename Table:

 ALTER TABLE mytable RENAME TO mytable2;

Drop Table:

 DROP TABLE public.mytable;

Add Column to Table:

 ALTER TABLE Public.mytable ADD column_name boolean NOT NULL DEFAULT False;

Alter Column Data Type Json:

ALTER TABLE public.mytable ALTER COLUMN json_col TYPE json USING (json_col::json);

Rename Sequence:

 ALTER SEQUENCE mytable_id_seq RENAME TO mytable_id_seq;

Sequence Table Owner:

 alter sequence mytable_id_seq owned by mytable.id;

Sequence Next Value:

 alter table mytable alter column mytable_id set default nextval('mytable_id_seq');

Add Foreign Key:

 alter table mytable ADD FOREIGN KEY (foreign_id) REFERENCES public.mytable2(foreign_id);

Create Index Json:

 CREATE INDEX mytable_idx ON Public.mytable((Data->'Key'->'Key'->>'value'));

Create Index:

 CREATE INDEX mytable_idx ON public.mytable(id);

Drop Index:

 DROP INDEX public.mytable_idx;

Re-Cluster Table:

 Cluster mytable using mytable_pkey;

Trigger:

 CREATE TRIGGER "tg_mytrigger" BEFORE UPDATE OF my_column OR INSERT ON public.mytable FOR EACH ROW EXECUTE PROCEDURE public.mytablestrigger();

Hive Kerberos Installation

We are going to install Hive over Hadoop and perform a basic query. Ensure you install Kerberos and Hadoop with Kerberos.

This assumes your hostname is “hadoop”

Download Hive:

wget http://apache.forsale.plus/hive/hive-2.3.3/apache-hive-2.3.3-bin.tar.gz
tar -xzf apache-hive-2.3.3-bin.tar.gz
sudo mv apache-hive-2.3.3-bin /usr/local/hive
sudo chown -R root:hadoopuser /usr/local/hive/

Setup .bashrc:

 sudo nano ~/.bashrc

Add the following to the end of the file.

#HIVE VARIABLES START
export HIVE_HOME=/usr/local/hive
export HIVE_CONF_DIR=/usr/local/hive/conf
export PATH=$PATH:$HIVE_HOME/bin
export CLASSPATH=$CLASSPATH:/usr/local/hadoop/lib/*:/usr/local/hive/lib/*
#HIVE VARIABLES STOP

 source ~/.bashrc

Create warehouse on hdfs

kinit -kt /etc/security/keytabs/myuser.keytab myuser/hadoop@REAL.CA
hdfs dfs -mkdir -p /user/hive/warehouse
hdfs dfs -mkdir /tmp
hdfs dfs -chmod g+w /tmp
hdfs dfs -chmod g+w /user/hive/warehouse

Create Kerberos Principals

cd /etc/security/keytabs
sudo kadmin.local
addprinc -randkey hive/hadoop@REALM.CA
addprinc -randkey hivemetastore/hadoop@REALM.CA
addprinc -randkey hive-spnego/hadoop@REALM.CA
xst -kt hive.service.keytab hive/hadoop@REALM.CA
xst -kt hivemetastore.service.keytab hivemetastore/hadoop@REALM.CA
xst -kt hive-spnego.service.keytab hive-spnego/hadoop@REALM.CA
q

Set Keytab Permissions/Ownership

sudo chown root:hadoopuser /etc/security/keytabs/*
sudo chmod 750 /etc/security/keytabs/*

hive-env.sh

cd $HIVE_HOME/conf
sudo cp hive-env.sh.template hive-env.sh

sudo nano /usr/local/hive/conf/hive-env.sh

#locate "HADOOP_HOME" and change to be this
export HADOOP_HOME=/usr/local/hadoop

#locate "HIVE_CONF_DIR" and change to be this
export HIVE_CONF_DIR=/usr/local/hive/conf

hive-site.xml

Chekck out this link for the configuration properties.

sudo cp /usr/local/hive/conf/hive-default.xml.template /usr/local/hive/conf/hive-site.xml

sudo nano /usr/local/hive/conf/hive-site.xml

#Modify the following properties

<property>
	<name>system:user.name</name>
	<value>${user.name}</value>
</property>
<property>
	<name>javax.jdo.option.ConnectionURL</name>
	<value>jdbc:postgresql://myhost:5432/metastore</value>
</property>
<property>
	<name>javax.jdo.option.ConnectionDriverName</name>
	<value>org.postgresql.Driver</value>
</property>
<property>
	<name>hive.metastore.warehouse.dir</name>
	<value>/user/hive/warehouse</value>
</property>
<property>
	<name>javax.jdo.option.ConnectionUserName</name>
	<value>hiveuser</value>
</property>
<property>
	<name>javax.jdo.option.ConnectionPassword</name>
	<value>PASSWORD</value>
</property>
<property>
	<name>hive.exec.local.scratchdir</name>
	<value>/tmp/${system:user.name}</value>
	<description>Local scratch space for Hive jobs</description>
</property>
<property>
	<name>hive.querylog.location</name>
	<value>/tmp/${system:user.name}</value>
	<description>Location of Hive run time structured log file</description>
</property>
<property>
	<name>hive.downloaded.resources.dir</name>
	<value>/tmp/${hive.session.id}_resources</value>
	<description>Temporary local directory for added resources in the remote file system.</description>
</property>
<property>
	<name>hive.server2.logging.operation.log.location</name>
	<value>/tmp/${system:user.name}/operation_logs</value>
	<description>Top level directory where operation logs are stored if logging functionality is enabled</description>
</property>
<property>
	<name>hive.metastore.uris</name>
	<value>thrift://0.0.0.0:9083</value>
	<description>IP address (or fully-qualified domain name) and port of the metastore host</description>
</property>
<property>
	<name>hive.server2.webui.host</name> 
	<value>0.0.0.0</value>
</property>
<property>
	<name>hive.server2.webui.port</name> 
	<value>10002</value>
</property>
<property>
	<name>hive.metastore.port</name>
	<value>9083</value>
</property>
<property>
	<name>hive.server2.transport.mode</name>
	<value>binary</value>
</property>
<property>
	<name>hive.server2.thrift.sasl.qop</name>
	<value>auth-int</value>
</property>
<property>
	<name>hive.server2.authentication</name>
	<value>KERBEROS</value>
	<description>authenticationtype</description>     
</property>
<property>
	<name>hive.server2.authentication.kerberos.principal</name>
	<value>hive/_HOST@REALM.CA</value>
	<description>HiveServer2 principal. If _HOST is used as the FQDN portion, it will be replaced with the actual hostname of the running instance.</description>
</property>
<property>
	<name>hive.server2.authentication.kerberos.keytab</name>
	<value>/etc/security/keytabs/hive.service.keytab</value>
	<description>Keytab file for HiveServer2 principal</description>  
</property>
<property>
	<name>hive.metastore.sasl.enabled</name>
	<value>true</value>
	<description>If true, the metastore thrift interface will be secured with SASL. Clients
	must authenticate with Kerberos.</description>
</property>
<property>
	<name>hive.metastore.kerberos.keytab.file</name>
	<value>/etc/security/keytabs/hivemetastore.service.keytab</value>
	<description>The path to the Kerberos Keytab file containing the metastore thrift 
	server's service principal.</description>
</property>
<property>
	<name>hive.metastore.kerberos.principal</name>
	<value>hivemetastore/_HOST@REALM.CA</value>
	<description>The service principal for the metastore thrift server. The special string _HOST will be replaced automatically with the correct host name.</description>
</property>
<property>
	<name>hive.security.authorization.enabled</name>
	<value>true</value>
	<description>enable or disable the hive client authorization</description>
</property>
<property>
	<name>hive.metastore.pre.event.listeners</name>
	<value>org.apache.hadoop.hive.ql.security.authorization.AuthorizationPreEventListener</value>
	<description>List of comma separated listeners for metastore events.</description>
</property>
<property>
	<name>hive.security.metastore.authorization.manager</name>
	<value>org.apache.hadoop.hive.ql.security.authorization.StorageBasedAuthorizationProvider</value>
	<description>
	Names of authorization manager classes (comma separated) to be used in the metastore
	for authorization. The user defined authorization class should implement interface
	org.apache.hadoop.hive.ql.security.authorization.HiveMetastoreAuthorizationProvider.
	All authorization manager classes have to successfully authorize the metastore API
	call for the command execution to be allowed.
</description>
</property>
<property>
	<name>hive.security.metastore.authenticator.manager</name>
	<value>org.apache.hadoop.hive.ql.security.HadoopDefaultMetastoreAuthenticator</value>
	<description>
	authenticator manager class name to be used in the metastore for authentication.
	The user defined authenticator should implement interface org.apache.hadoop.hive.ql.security.HiveAuthenticationProvider.
</description>
</property>
<property>
	<name>hive.security.metastore.authorization.auth.reads</name>
	<value>true</value>
	<description>If this is true, metastore authorizer authorizes read actions on database, table</description>
</property>
<property>     
	<name>datanucleus.autoCreateSchema</name>     
	<value>false</value>
</property>

Hadoop core-site.xml

Notice here how it’s .hive. that is used with the storage based authentication.

sudo nano /usr/local/hadoop/etc/hadoop/core-site.xml

<property>
	<name>hadoop.proxyuser.hive.hosts</name>
	<value>*</value>
</property>
<property>
	<name>hadoop.proxyuser.hive.groups</name>
	<value>*</value>
</property>

Install Postgres 9.6

Follow this install for installing Psotgresql 9.6

sudo su - postgres
psql

CREATE USER hiveuser WITH PASSWORD 'PASSWORD';
CREATE DATABASE metastore;
GRANT ALL PRIVILEGES ON DATABASE metastore TO hiveuser;
\q
exit

Initiate Postgres Schema

schematool -dbType postgres -initSchema

Start Metastore & HiveServer2

nohup /usr/local/hive/bin/hive --service metastore --hiveconf hive.log.file=hivemetastore.log >/var/log/hive/hivemetastore.out 2>/var/log/hive/hivemetastoreerr.log &

nohup /usr/local/hive/bin/hiveserver2 --hiveconf hive.metastore.uris=" " --hiveconf hive.log.file=hiveserver2.log >/var/log/hive/hiveserver2.out 2> /var/log/hive/hiveserver2err.log &

Auto Start

sudo mkdir /var/log/hive/
sudo chown root:hduser /var/log/hive
sudo chmod 777 /var/log/hive

crontab -e

#Add the following
@reboot nohup /usr/local/hive/bin/hive --service metastore --hiveconf hive.log.file=hivemetastore.log >/var/log/hive/hivemetastore.out 2>/var/log/hive/hivemetastoreerr.log &
@reboot nohup /usr/local/hive/bin/hiveserver2 --hiveconf hive.metastore.uris=" " --hiveconf hive.log.file=hiveserver2.log >/var/log/hive/hiveserver2.out 2> /var/log/hive/hiveserver2err.log &

Now you can check the hive version

 hive --version

Hive Web URL

http://hadoop:10002/

Beeline

#We first need to have a ticket to access beeline using the hive kerberos user we setup earlier.
kinit -kt /etc/security/keytabs/hive.service.keytab hive/hadoop@REALM.CA

#Now we can get into beeline using that principal
beeline -u "jdbc:hive2://0.0.0.0:10000/default;principal=hive/hadoop@REALM.CA;"

#You can also just get into beeline then connect from there
beeline
beeline>!connect jdbc:hive2://0.0.0.0:10000/default;principal=hive/hadoop@REALM.CA

#Disconnect from beeline
!q

References

http://www.bogotobogo.com/Hadoop/BigData_hadoop_Hive_Install_On_Ubuntu_16_04.php
https://maprdocs.mapr.com/home/Hive/Config-RemotePostgreSQLForHiveMetastore.html
https://cwiki.apache.org/confluence/display/Hive/Hive+Schema+Tool#HiveSchemaTool-TheHiveSchemaTool
https://cwiki.apache.org/confluence/display/Hive/GettingStarted#GettingStarted-InstallationandConfiguration

MongoDB Testing

MongoDB is a NoSQL type of database.

Installation on Windows is rather simple just follow the prompts. Just download from here. You can follow the installation instructions from here if you so choose. It has a max document size of 16MB but you can see all base configurations from here. Once you install MongoDB you will have to start the server. Run C:\Program Files\MongoDB\Server\3.2\bin\mongod.exe. Then you can run C:\Program Files\MongoDB\Server\3.2\bin\mongo.exe to start querying the data.

MongoDB is a document db. Adhoc querying is easy if the data you are querying is within the 16MB max size. There is no tables in MongoDb they are called Collections instead. They are used to store bson documents. Which are basically json. Querying data is fast I have found. It uses the map reduce context to querying data. If you are not familiar with map reduce it’s basically just JavaScript.

I will have more later as I test more and more.

I have done some testing and can give you some basic code below.

Show All Collections:

 show collections

 

See All Databases:

 show dbs

 

Drop A Collection:

 db.COLLECTIONNAME.drop()

 

Drop All Collections:

 db.getCollectionNames().forEach(function(c) { if (c.indexOf("system.") == -1) db[c].drop(); })

 

Create A Collection:

 db.createCollection("logs", {autoIndexId: true})

 

Create An Index:

 db.logs.createIndex( { KEY: VALUE } )

 

MapReduce:

 db.COLLECTION.mapReduce(function(){
      var key = WHATEVER;
      var value = WHATEVER;

      emit(key, value);
},
function(key, values) {
      var output = {
            KEY: VALUE
      };
      return output;
}, {out: {inline: 1}})

 

Find matching element:
The .pretty() displays the json pretty.

 db.COLLECTIONNAME.find({ "KEY.SUBKEY.ARRAYINDEX" : VALUE }).pretty()

 

Find record that match in using greater and less than or equal to and display the id field associated.

 db.COLLECTIONNAME.find({ "KEY" : {$elemMatch: { $elemMatch: { $gt: 0.0005, $lte: 0.0005862095 } } } }, { id: "_id"  }).pretty()

 

Find record using IN and returning id field.

 db.COLLECTIONNAME.find({ "KEY" : { $in: [ 123,456 ] } }, { id: "_id" }).pretty()

 

You can loop through data and insert it into a new collection

 db.COLLECTIONNAME.find({ "KEY.SUBKEY.ARRAYINDEX" : 0 }).forEach(function(obj){ db.NEWCOLLECTION.insert(obj) })

 

CouchBase Testing

CouchBase is a NoSQL type of database.

Installation on Windows is rather simple just follow the prompts. Just download from here. Unfortunately at this time CouchBase on Windows wants you disable the firewall. I don’t recommend this and due to this critical issue itself do not currently recommend this until it has been fixed. Once installed it is viewable from http://localhost:8091. It has a max document size of 20MB but you can see all base configurations from here. CouchBase is a document db. It has fast importing of documents. CouchBase has views as like CouchDB. A view in CouchBase is like querying for data but not like CouchDB’s views. CouchBase still has Index’s. It’s view has a fast view rebuild which makes querying data faster than in CouchDB. It uses the map reduce context to creating views. If you are not familiar with map reduce it’s basically just JavaScript.

I will have more later as I test more and more.

I have done some testing with a view and can give you some syntax for writing one.

 function(doc, meta) {
      emit(meta.id, output);
}

CouchDB Testing

CouchDB is a NoSQL type of database. CouchBase is related but different entirely.

Installation on Windows is rather simple just follow the prompts. Just download from here. Once installed it is viewable from http://localhost:5984/_utils/. It has a max document size of 4GB but you can see all base configurations from here. CouchDB is a document db. If you come from a traditional relational setting then you might get frustrated at first because well querying documents is slow at first because you need to create a view. A view basically is kind of like an index from relational setting. Each different way you query the data you need to create a view. What get’s a little ridiculous is that as your data grows the view isn’t updated with newest data till the view is called. So at first it’s slow but each time it gets queried it gets faster. It uses the map reduce context to creating views. If you are not familiar with map reduce it’s basically just JavaScript.

I will have more later as I test more and more.

I have done some testing with a view and can give you some syntax for writing one.

 function(doc) {
      //Add any code you want
      emit(key, output);
}

 

Postgres: PgAgent Installation

This describes the necessary steps to install PgAgent on Ubuntu 14.04.

PgAgent:

  • apt-get install pgagent
  • -u postgres psql
  • CREATE EXTENSION pgagent;
    • Check it exists by running select * from pg_extension;
  • Edit the .pgpass file and add the connections you will need.
    • If you don’t know how to add the .pgpass file see adding pgpass below.
  • Execute the following command
    • pgagent hostaddre=localhost dbname=postgres user=postgres
  • You can also follow https://www.pgadmin.org/docs/1.8/pgagent-install.html as well for installation.

.pgpass:

  • touch .pgpass
    • Put it in home directory.
    • It contains information on connections.
      • hostname:port:database:username:password
  • Modify permissions for owner read write only
    • chmod 600 .pgpass

Postgres: Setup

If you want to setup a database server for Postgresql 9.6 these are the basic steps that should be done when working with Ubuntu 16.04.

Installation:

  • sudo apt-get update
  • sudo apt-get upgrade
  • sudo reboot (if necessary)
  • sudo add-apt-repository “deb https://apt.postgresql.org/pub/repos/apt/ trusty-pgdg main”
  • tee -a /etc/apt/sources.list.d/pgdg.list
  • wget –quiet -O – https://www.postgresql.org/media/keys/ACCC4CF8.asc | sudo apt-key add –
  • apt-get upgrade
  • sudo apt-get install postgresql-9.6
  • cd /var/
  • mkdir pg_log (This is where the logs will be stored)
  • chmod o+wxr /var/pg_log/
  • cd /etc/postgresql/9.6/main
  • nano postgresql.conf:
    • CONNECTIONS AND AUTHENTICATION
      • set “listen_addresses” = ‘*’
      • Set “max_connections” to 250 (or a reasonable number that works for your setup)
      • logging_collector = on
      • log_directory = ‘/var/pg_log’
      • log_file_mode = 0506
    • AUTOVACUUM PARAMETERS
      • autovacuum = on
        • Remove #
      • autocavuum_analyze_threshold = 100
      • autovacuum_vacuum_threshold = 100
      • track_counts = on
  • nano pg_hba.conf
    • host all all 0.0.0.0/0 md5
    • local all postgres trust
  • systemctl enable postgresql
  • systemctl start postgresql
  • /etc/init.d/postgresql restart
  • -u postgres psql template1
  • ALTER USER postgres with encrypted password ‘SET A PASSWORD’;
  • \q
  • -u postgres psql
  • CREATE EXTENSION adminpack;
    • To ensure it was installed correctly run “select * from pg_extension;”
  • If you want to view the postgresql service to find what directories are currently being using then run “ps auxw | grep postgres | grep — -D”

Create DB:

  • -u postgres psql
  • CREATE DATABASE ##DATABASE NAME##;
    • First run \l to check that DB doesn’t exist already.
  • CREATE ROLE ##ROLE NAME## LOGIN ENCRYPTED PASSWORD ‘##PASSWORD##’;
  • GRANT ##REQUIRED## PRIVILEGES ON DATABASE ##DATABASE NAME##TO ##ROLE NAME##;

Useful PSQL Commands

  • \l = list database
  • \q = quit
  • \c DBNAME = switch to db
  • \dt = list tables
  • ALTER DATABASE name OWNER TO new_owner
  • ALTER TABLE test OWNER TO test;