sql – Programming & Mustangs!

Python: pyodbc with SQL Server

This post is in regards to connecting to SQL Server using pyodbc.

Install package

pip install pyodbc

If you are running in Databricks then the current driver will be “{ODBC Driver 17 for SQL Server}”.

If you are running in Synapse then the current driver will be “{ODBC Driver 18 for SQL Server}”.

Check pyodbc Version

import pyodbc
pyodbc.drivers()

Check Which Version of pyodbc in Databricks

%sh
cat /etc/odbcinst.ini

Install Databricks driver 17

curl https://packages.microsoft.com/keys/microsoft.asc | apt-key add -
curl https://packages.microsoft.com/config/ubuntu/20.04/prod.list > /etc/apt/sources.list.d/mssql-release.list
apt-get update
ACCEPT_EULA=Y apt-get install msodbcsql17
apt-get -y install unixodbc-dev

Connect using SQL Auth

I do not recommend SQL Auth

import pyodbc

secret = "<GET SECRET SECURELY>"

connection = pyodbc.connect('DRIVER={ODBC Driver 17 for SQL Server};Server=tcp:<SERVER_NAME>;PORT=<PORT>;Database=<DATABASE>;Uid=<USER>;Pwd=<SECRET>;Encrypt=yes;TrustServerCertificate=no;Connection Timeout=<TIMEOUT>;')

Connect Using Domain Auth

import pyodbc

secret = "<GET SECRET SECURELY>"

connection = pyodbc.connect('DRIVER={ODBC Driver 17 for SQL Server};Server=tcp:<SERVER_NAME>;PORT=<PORT>;Database=<DATABASE>;Uid=<USER>;Pwd=<SECRET>;Encrypt=yes;TrustServerCertificate=no;Connection Timeout=<TIMEOUT>;Authentication=ActiveDirectoryPassword')

Connect using Azure SPN

pip install msal

import struct
import msal

global_token_cache = msal.TokenCache()
secret = "<GET SECRET SECURELY>"

global_spn_app = msal.ConfidentialClientApplication(
    <CLIENT_ID>, Authority='https://login.microsoftonline.com/<TENANT_ID>',
    client_credential=secret,
    token_cache=global_token_cache,
)

result = global_spn_app.acquire_token_for_client(scopes=['https://database.windows.net//.default'])
SQL_COPT_SS_ACCESS_TOKEN = 1256

token = bytes(result['access_token'], 'utf-8')
exptoken = b"";

for i in token:
    exptoken += bytes({i});
    exptoken += bytes(1);

token_struct = struct.pack("=i", len(exptoken)) + exptoken;

connection = pyodbc.connect('DRIVER={ODBC Driver 17 for SQL Server};Server=tcp:<SERVER_NAME>;PORT=<PORT>;Database=<DATABASE>;Uid=<USER>;Pwd=<SECRET>;Encrypt=yes;TrustServerCertificate=no;Connection Timeout=<TIMEOUT>;' attrs_before = { SQL_COPT_SS_ACCESS_TOKEN:tokenstruct })

Once you have the connection you can setup the cursor.

cursor = connection.cursor()

Then execute a command

command = "<COMMAND>"
params = ()
cursor.execute(command, params)
connection.commit()

After you Are finish Close

cursor.close()
connection.close()

Databricks: Notebook SQL

This post is how to work with Databricks SQL through a Notebook.

Create a Temp View of a DataFrame.

df = <SOMETHING>
df.createOrReplaceTempView("<TABLE_NAME>")

Drop a Table

%sql
drop table <SCHEMA>.<TABLE>;

Describe Table

%sql
desc table extended <SCHEMA>.<TABLE>;

Describe Detail

%sql
describe detail <SCHEMA>.<TABLE>;

Show Table Properties

%sql
SHOW TBLPROPERTIES <SCHEMA>.<TABLE>;

Describe History

%sql
describe history <SCHEMA>.<TABLE>;

Create Schema

%sql
CREATE SCHEMA IF NOT EXISTS <SCHEMA>;

Create Parquet Table

%sql
CREATE TABLE <SCHEMA>.<TABLE> USING PARQUET LOCATION 'abfss://<COTNAINER>@<STORAGE_ACCOUNT>.dfs.core.windows.net/<FOLDER>/'

Create Delta Table

%sql
CREATE TABLE <SCHEMA>.<TABLE> USING DELTA LOCATION 'abfss://<COTNAINER>@<STORAGE_ACCOUNT>.dfs.core.windows.net/<FOLDER>/'

Upsert

MERGE INTO schema.table t \
USING ( \
  SELECT columns \
  FROM table \
) AS source ON (source.column = t.column) \
WHEN NOT MATCHED THEN \
  INSERT ( \
    ( \
      column, column2 \
    ) \
  VALUES ( \
    source.column, source.column2 \
  ) \
WHEN MATCHED THEN \
  UPDATE SET \
    t.column = source.column \

Hive: Struct

This tutorial will show you how to use struct. If you have no installed Hive yet please follow this tutorial.

Create Table with Struct:

CREATE TABLE test_struct (
    columnA STRING,
    columnB VARCHAR(15),
    columnC INT,
    columnD TIMESTAMP,
    columnE DATE,
    columnF STRUCT<key:STRING, value:INT>
)
STORED AS ORC;

Insert Data:

INSERT INTO test_struct 
SELECT '1', '2', 1, '2019-02-07 20:58:27', '2019-02-07', NAMED_STRUCT('key', 'val', 'value', 1);

Select Data:
This will give back the value of “key” and “value” in columnF.

SELECT columnF.key, columnF.value
FROM test_struct;

Hive: Map

This tutorial will show you how to use map. If you have no installed Hive yet please follow this tutorial.

Create Table with Map:

CREATE TABLE test_map (
    columnA STRING,
    columnB VARCHAR(15),
    columnC INT,
    columnD TIMESTAMP,
    columnE DATE,
    columnF MAP<STRING, INT>
)
STORED AS ORC;

Insert Data:

INSERT INTO test_map 
SELECT '1', '2', 1, '2019-02-07 20:58:27', '2019-02-07', MAP('Val', 1);

Select Data:
This will give back the value of “Val” in columnF.

SELECT columnF['Val']
FROM test_map;

Hive: Misc

This tutorial will show you some misc usage for working with Hive. If you have no installed Hive yet please follow this tutorial.

NVL:
Check if value is null then substitute other value.

SELECT NVL(columnA, 'was null')
FROM test;

CAST:
If columnE was a date and you wanted it to be a string.

SELECT CAST(columnE AS STRING)
FROM test;

Concat:
This will concat the strings together giving “Test the code!”

SELECT CONCAT('Test', ' the ', 'code!');

CONCAT_WS:
This will concat the strings together starting at “Test” and use the first position as the seperator giving “Test the code”

SELECT CONCAT_WS(' ', 'Test', 'the', 'code');

MIN:
This will give the minimum value of columnC

SELECT MIN(columnC) AS Max_ColumnC
FROM test;

MAX:
This will give the maximum value of columnC

SELECT MAX(columnC) AS Max_ColumnC
FROM test;

DISTINCT:
This will select the distinct columnA and columnB

SELECT DISTINCT columnA, columnB
FROM test;

CASE:

SELECT CASE WHEN columnA = 'val' THEN 'Is Val' ELSE 'Not Val' END
FROM test;

COLLECT_SET:
This will collect all values in columnC and select the first index.

SELECT COLLECT_SET(columnC)[0]
FROM test;

CURRENT_DATE:
This will give you the current date in the format YYYY-MM-dd.

SELECT CURRENT_DATE();

UNIX_TIMESTAMP:
This will give you the current timestamp from EPOCH. IE: 1549591492

SELECT UNIX_TIMESTAMP();

FROM_UNIXTIME:
This will take a timestamp and display it in the format YYYY-MM-dd HH:MM:SS.

SELECT FROM_UNIXTIME(1549591492);

TO_DATE:
This will convert a date to YYYY-MM-dd.

SELECT TO_DATE('2019-02-01 01:01:01');

YEAR:

SELECT YEAR(columnE)
FROM test;

MONTH:

SELECT MONTH(columnE)
FROM test;

DAY:

SELECT DAY(columnE)
FROM test;

DATE_ADD:

SELECT DATE_ADD('2019-01-01', 4);

UPPER:
This will upper case columnA.

SELECT UPPER(columnA)
FROM test;

LOWER:
This will lower case columnA.

SELECT LOWER(columnA)
FROM test;

TRIM:
This will trim leading and trailing spaces in columnA.

SELECT TRIM(columnA)
FROM test;

LITERALS:
If a column contains a space you will need to use literal in order to use the AS keyword or when you are defining it in the create table command. Although I don’t recommend this it is possible.

SELECT columnA AS `test column`
FROM test;

Hive: Variables

This tutorial will show you some common usage for working with Hive variables. If you have no installed Hive yet please follow this tutorial.

Create HiveVar:

SET hivevar:MY_DATE='2019-02-01';

Show HiveVar:
This will show you the value of the variable.

SET MY_DATE;

Use HiveVar:

SELECT ${MY_DATE};

Hive: Views

This tutorial will show you some common usage for creating views. If you have no installed Hive yet please follow this tutorial.

Create View:

CREATE VIEW IF NOT EXISTS test_view AS
SELECT * FROM test;

Drop View:

DROP VIEW IF EXISTS test_view;

Hive: Tables

This tutorial will show you some common usage for working with tables. If you have no installed Hive yet please follow this tutorial.

Show Tables:

SHOW TABLES;
SHOW TABLES LIKE '*test*';

Table Creation:

CREATE TABLE test (
    columnA STRING,
    columnB VARCHAR(15),
    columnC INT,
    columnD TIMESTAMP,
    columnE DATE
)
STORED AS ORC;

Table Creation with Partitioning:

CREATE TABLE test_partition (
    columnA STRING,
    columnB VARCHAR(15),
    columnC INT,
    columnD TIMESTAMP,
    columnE DATE
)
PARTITIONED BY (columnF INT)
STORED AS ORC;

Inline Table Creation:

CREATE TABLE test_inline STORED AS ORC AS
SELECT *
FROM test;

Temporary Table Creation:

CREATE TEMPORARY TABLE temp (
    columnA STRING,
    columnB VARCHAR(15),
    columnC INT,
    columnD TIMESTAMP,
    columnE DATE
)
STORED AS ORC;

DESC Table:
This will show you the basic definition of a table.

DESC test;

DESC EXTENDED Table:
This will show you the extended definition of a table.

DESC EXTENDED test;

Drop Table:

DROP TABLE IF EXISTS temp;

NiFi: ExecuteSQL Processor

In this tutorial I will guide you through how to add a processor for querying a SQL table to NiFi.

For this tutorial you will need an AVRO schema called “dttest” and it’s contents are as follows.

{
     "type": "record",
     "namespace": "com.example",
     "name": "FullName",
     "fields": [
       { "name": "id", "type": "int" },
       { "name": "name", "type": "string" }
     ]
}

First we need to drag the processor onto the grid.

Next we need select the processor ExecuteSQLRecord.

Next we must configure the processor.

Now we must create the JsonRecordWriter service.

Now we name the JsonRecordWriter

Configure the JsonWriter

Next we create the DB Connection Service

Next we name the DB Connection Service

Configure the DB Service

Now validate all the settings are as below

Now you are all done. It will now query your table.

Postgres: Vacuum

You should set autovacuum on. If it is turned off (which is the default) it will require manual vacuuming and analyzing to be performed.

You can run vacuuming by:

VACUUM (VERBOSE, ANALYZE)

You can set auto vacuuming on by the below commands. You will need to modify the “postgresql.conf” file for this. Windows is located “C:\Program Files\PostgreSQL\9.4\data\postgresql.conf” and on Ubuntu it is “/etc/postgresql/9.4/main/postgresql.conf”. In the off chance that it isn’t then run the below first.

find / -type f -name "postgresql.conf"

Locate the following line of text “# AUTOVACUUM PARAMETERS” and apply the following. You should note that turning on automatically will also run “VACUUM ANALYZE” command to update statistics.

autovacuum = on (remove #)
autocavuum_analyze_threshold = 100
autovacuum_vacuum_threshold = 100
track_counts = on

Now you will need to restart postgresql service.

Ubuntu:

/etc/init.d/postgresql restart
invoke-rc.d postgresql restart

Windows:

services.msc restart service "postgresql-x64-9.4"

Formatting and Readability

This is the guest editorial I put on SQL Server Central.

By Oliver Gaudreault, 2010/06/18

Today we have a guest editorial from Oliver Gaudreault.

Too often I have seen stored procedures, functions and general SQL scripts that are indecipherable. What does this mean? It means that you will be forced to spend time making the code readable if you are required to debug or investigate code. Many times before I have run across someone who edited an IF statement thinking it was in a block but wasn’t. The code in the IF statement applied to everything and the result was not what the original business logic intended. Proper structure and form of the code would remedy this problem.

Using the following would help to make the code more readable:

Ensuring all flow control statements start and terminate appropriately.
Indentation is properly structured.
Ensure that all flow control statements (such as IF, WHILE, etc) terminate appropriately. For flow control statements that are inline you don’t necessarily need a BEGIN and END to the flow control statement. However in my experience I believe it is necessary.
Not only does this methodology apply to SQL but I believe it also applies for any modern programming language. Some may think that by making your code readable that it might render you replaceable. I beg to differ! I want to see people programming who are versatile. Programmers, who can quickly and easily find and repair an issue with little wasted effort. Lack of time should not equal bad practices. Good practises should equal more time.

This is, of course debatable and I would like to know others thoughts on this?