Notes on using Python for working with Tableau, Solr and Lucidworks Fusion server data.

Environment

Install Python (Windows)
choco install python3

Install PyCharm
choco install pycharm-community

Install Tableau Server Client (TSC)
pip install tableauserverclient

Notes:

  • The Tableau REST API does not allow substring filtering (has operator) on anything other than Query_Jobs. In other words, you cannot return workbooks with the string "foo" in the title. The in operator is available on workbooks, datasources and views, but is only meant to work on tags. Otherwise, the equals operator takes an exact string. See reference.
  • Choice! Return ALL workbooks from server, store results in memory or file, iterate over results in using Python to return items by substring.
    The Tableau server-client is currently missing a get-view-by-id method for views, and a for loop using the get-workbook-by-id is slow.
  • Choice! Use python REST instead of the tableau-server-client for tagging.
  • Projects don't have a site_id. Need to authenticate to each site individually and find what projects the site contains.

Code Stuff

Code - Tag Tableau Workbooks and Views

Individual Tableau report developers can easily make their products more find-able on a Tableau server by manually adding tags to their views, workbooks and data sources. But, let's say non-techy business users want to scan 20 thousand Tableau data assets and conditionally apply something like a "Confidential" tag on many assets at once without involving IT. This might call for a Python script that works with the Tableau REST API and follows the values specified in a CSV file.

Declare imports

The Tableau Server/Client is used as a fast way to get all Tableau Workbooks and Views. The REST module is used to add tags to the Tableau server by LUID. The XML parsing module is used for POSTing data to Tableau (JSON could be used, but the environment I'm working with here prefers XML) The CSV parser gets input from a CSV file.

# -------------------
# Imports
# -------------------
# Import the Tableau Server-Client module
import tableauserverclient as TSC

# Import the logging modules
import logging, logging.config

# Import the REST module
import requests

# Import the XML parsing module
import untangle

# Import the csv parser
import csv

Log activities

During development, the logger can be set to stream feedback to the IDE. For implementation, the logger can be set to log to a file. The file is useful for validating and documenting the work that's been done. The code includes a statement for filtering out some of the non-root Tableau logging feedback.

# -------------------
# Logging
# -------------------
logger = logging.getLogger()

# log to console (stream) or file
handler = logging.StreamHandler()
# handler = logging.FileHandler('app.log', mode='w')

# settings
formatter = logging.Formatter('%(asctime)s %(name)-2s %(levelname)-2s %(message)s')
handler.setFormatter(formatter)
logger.addHandler(handler)
logger.setLevel(logging.INFO)

# filter out Tableau (non-root) logging
logging.config.dictConfig({
    'version': 1,
    'disable_existing_loggers': True,
})

Do some prep work

Locate the CSV file in the current directory and get the Tableau Sites and Projects that needs to be processed. This is the precursor to getting the CSV file from a service.

# -------------------
# Data
# -------------------
# Get a list of items to process from CSV file
dataset = []
with open ('tableau_security_tags_list.csv') as csvDataFile:
    csvReader = csv.reader(csvDataFile)
    count = 0
    for row in csvReader:
        if len(row[5]) > 0:
            if count > 0:
                dataset.append([row[0], row[1], row[2], row[3], row[4], row[5]])
            count = count + 1

# logging.info('Process the following data:')
# logging.info(dataset)

# Loop through the items list and build list of unique URLs
logging.info('Process the following urls:')
listOfUrls = []
listOfUrlsTest = []
for item in dataset:
    u = item[1].strip()
    s = item[2].strip()
    id = u + '-' + s.replace(' ', '-')

    if id not in listOfUrlsTest:
        listOfUrlsTest.append(id)

        # Add url to list
        listOfUrls.append([id, u])
        logging.info(u)

# Loop through the items list and build list of unique sites
logging.info("*************************")
logging.info('Process the following sites:')
listOfSites = []
listOfSitesTest = []
for item in dataset:
    u = item[1].strip()
    s = item[2].strip()
    id = u + '-' + s.replace(' ', '-')

    if id not in listOfSitesTest:
        listOfSitesTest.append(id)

        # Add site to list
        listOfSites.append([id, u, s])
        logging.info(s)

# Loop through the items list and build list of projects
logging.info("*************************")
logging.info('Process the following projects:')
listOfProjects = []
for item in dataset:
    u = item[1].strip()
    s = item[2].strip()
    p = item[3].strip()
    t = item[4].strip()
    a = item[5].strip()
    id = u + '-' + s.replace(' ', '-') + p.replace(' ', '-')

    # Add project to list
    listOfProjects.append([id, s, p, t, a])
    logging.info(p)

Set variables, cache and test Tableau assets

This is the meant and potatoes work of applying the tags. Tableau requires separate logins for each site. There's a bit of trickiness around for the "default" site that does not apply to named sites.

# -------------------
# Process the tags
# -------------------
# Process data for each url separately
for url in listOfUrls:

logging.info("*************************")

    # Server admin credentials
        tableauSignIn = 'AD Service Account username'
        tableauPassword = 'AD Service Account password'

    # Set the Tableau server-client url
    server = TSC.Server(url[1])

    # Set the API version
    server.use_server_version()
    api_version = server.version

    # Set the Tableau Python API sign-in authorization (needs to be server admin)
    initialTableauAuth = TSC.TableauAuth(tableauSignIn, tableauPassword)

    # Cache a list of site LUIDs
    site_cache = {}
    with server.auth.sign_in(initialTableauAuth):
        for luid in TSC.Pager(server.sites):
            site_cache[luid.id] = luid.name
    logging.info("site luid is {" + "\n".join("{!r}: {!r},".format(k, v) for k, v in site_cache.items()) + "}")

    response = requests.request("POST", url[1] + '/api/' + api_version + '/auth/signout')

    logging.info(listOfSites)

    # Get a list of sites related to the url
    for site in listOfSites:

        if site[0] == url[0]:

            # Set the site LUID used for the REST calls
            siteId = ''
            for key, value in site_cache.items():
                if value.lower() == site[2].lower():
                    siteId = key

            # Set the site name
            siteName = ''
            if value.lower() == 'default':
                siteName = ""
            else:
                siteName = value.replace(' ','')

            # Set the authorization string = TSC.TableauAuth('USERNAME', 'PASSWORD', site_id='CONTENTURL')
            # Note: CONTENTURL is "url_namespace" for the site in the Postgres Sites table
            # Note: CONTENTURL has no spaces
            # Note: CONTENTURL for the "Default" site is an empty string
            # Note: CONTENTURL "Default" site is not always named "Default"
            # Note: CONTENTURL does not handle being a variable with empty quotes, so create with quotes below
            if (site[2].lower() == 'default') or (site[2].lower() == 'sjhs tableau site'):
                tableauAuth = TSC.TableauAuth(tableauSignIn, tableauPassword, site_id="")
            else:
                tableauAuth = TSC.TableauAuth(tableauSignIn, tableauPassword, site_id=site[2].replace(' ',''))

            # Sign in to a specific site
            with server.auth.sign_in(tableauAuth):

                logging.info("")
                logging.info('Signing in to the "' + site[2] + '" site')

                # Set the Tableau REST sign-in token
                auth_url = url[1] + '/api/' + server.version + '/auth/signin'
                auth_payload = '<tsRequest><credentials name="' + tableauSignIn + '" password="' + tableauPassword + '"><site contentUrl="' + siteName + '" /></credentials></tsRequest>'
                auth_headers = {'Content-Type': 'application/xml'}
                auth_response = requests.request("POST", auth_url, headers=auth_headers, data=auth_payload)
                auth_response_obj = untangle.parse(auth_response.text)
                auth_token = auth_response_obj.tsResponse.credentials['token']

                # Set the XML REST headers with the auth token
                xml_headers = {'X-Tableau-Auth': auth_token, 'Content-Type': 'text/plain'}

                # Cache assets locally
                with server.auth.sign_in(tableauAuth):
                    logging.info('')
                    logging.info('Caching projects, workbooks, and views data in "' + site[2] + '" site')

                    # Cache all projects
                    # -------------------
                    project_counter, project_cache = 0, {}
                    for project in TSC.Pager(server.projects):
                        project_cache[project_counter] = (project.id, project.name)
                        project_counter = project_counter + 1
                    # logging.info("{" + "\n".join("{!r}: {!r},".format(k, v) for k, v in project_cache.items()) + "}")

                    # Cache all workbooks
                    # -------------------
                    workbook_counter, workbook_cache = 0, {}
                    for workbook in TSC.Pager(server.workbooks):
                        workbook_cache[workbook_counter] = (workbook.id, workbook.name, workbook.project_id, workbook.tags)
                        workbook_counter = workbook_counter + 1
                    # logging.info("{" + "\n".join("{!r}: {!r},".format(k, v) for k, v in workbook_cache.items()) + "}")

                    # Cache all views
                    # -------------------
                    view_counter, view_cache = 0, {}
                    for view in TSC.Pager(server.views):
                        view_cache[view_counter] = (view.id, view.name, view.project_id, view.workbook_id, view.tags)
                        view_counter = view_counter + 1
                    # logging.info("{" + "\n".join("{!r}: {!r},".format(k, v) for k, v in view_cache.items()) + "}")

            for project in listOfProjects:
                u = project[0]
                s = project[1]
                projectName = project[2]

                if u == url[0]+projectName.replace(' ', '-'):
                    logging.info('')

                    # Set the tag label
                    newTagLabel = project[3]

                    # Set the action to take
                    actionToTake = project[4]

                    # Build list of workbooks and views needing tags
                    # ------------------
                    for project_value in project_cache.values():
                        cachedProjectId = project_value[0]
                        cachedProjectName = project_value[1]

                        # Enable partial match
                        # if re.search(x, project.name):

                        # Enable exact match
                        if cachedProjectName == projectName:
                            logging.info('Project - ' + projectName)

                            for workbook_value in workbook_cache.values():
                                workbookId = workbook_value[0]
                                workbookName = workbook_value[1]
                                workbookProjectId = workbook_value[2]
                                workbookTags = workbook_value[3]
                                if workbookProjectId == cachedProjectId:

                                    # create the REST URL
                                    workbook_Url = url[1] + '/api/' + api_version + '/sites/' + siteId + '/workbooks/' + workbookId + '/tags'

                                    # add or delete tags
                                    tag_item('Workbook', workbookName, workbook_Url, workbookTags, newTagLabel, actionToTake)

                                    for view_value in view_cache.values():
                                        viewId = view_value[0]
                                        viewName = view_value[1]
                                        viewWorkbookId = view_value[3]
                                        viewTags = view_value[4]
                                        if viewWorkbookId == workbookId:

                                            # create the REST URL
                                            view_Url = url[1] + '/api/' + api_version + '/sites/' + siteId + '/views/' + viewId + '/tags'

                                            # add or delete tags
                                            tag_item('View', viewName, view_Url, viewTags, newTagLabel, actionToTake)


    # Log out of tableau server
    response = requests.request("POST", url[1] + '/api/' + api_version + '/auth/signout')

    # Sign out
    logger.info('')
    logger.info('Signing out of ' + url[1])

# Job complete
logger.info('')
logger.info('Job is complete')

Apply tags

Actual tagging is done with a function that composes each POST request. The function looks for a setting in the CSV file which defines whether to add, test, or verify the job. This is useful for collecting separate log files which record the previous and post job states of tags on the Tableau server.

# -------------------
# Define the tagging process
# -------------------
def tag_item(type, name, itmUrl, existingTags, newTagLabel, actionToTake):
    xml_payload = '<tsRequest><tags><tag label="' + newTagLabel + '"/></tags></tsRequest>'

    # Not already tagged
    if newTagLabel not in existingTags:

        # remove label
        if actionToTake == 'add':
            logging.info('___ Add "' + newTagLabel + '" tag to ' + type + ' - ' + name + '"')
            response = requests.request("PUT", url=itmUrl, headers=xml_headers, data=xml_payload)
            logging.info(response.text.encode('utf8'))

        # announce that label does not exist (test/verify)
        else:
            logging.info('___Would have added "' + newTagLabel + '" tag to ' + type + ' - ' + name + '"')

    # Already tagged
    else:

        # remove label
        if actionToTake == 'remove':
            logging.info('___ Remove ' + newTagLabel + ' from "' + type + ' - ' + name + '"')
            response = requests.request("DELETE", url=itmUrl + '/' + newTagLabel, headers=xml_headers)
            logging.info(response.text.encode('utf8'))

        else:
            # announce that label already exists (test/verify)
            logging.info('___ ' + type + ' - ' + name + ' is already tagged "' + newTagLabel + '"')

Send Tableau Tags to Solr Index

Declare imports

The Tableau Server/Client is used as a fast way to get all Tableau Workbooks and Views, the REST module is used to add tags to the Tableau server by LUID, and the pysolr module gets the data into the Solr index.

# -------------------
# Imports
# -------------------
# Import the Tableau Server-Client module
import tableauserverclient as TSC

# Import the logging modules
import logging, logging.config

# Import the REST module
import requests

# Make solr request for all workbooks
import pysolr

Log activities

During development, the logger can be set to stream feedback to the IDE. For implementation, the logger can be set to log to a file. The file is useful for validating and documenting the work that's been done. The code includes a statement for filtering out some of the non-root Tableau logging feedback.

# -------------------
# Logging
# -------------------
# Set the logger to stream or file
logger = logging.getLogger()

# log to console (stream) or file
handler = logging.StreamHandler()
# handler = logging.FileHandler('app.log', mode='w')

# settings
formatter = logging.Formatter('%(asctime)s %(name)-2s %(levelname)-2s %(message)s')
handler.setFormatter(formatter)
logger.addHandler(handler)
logger.setLevel(logging.INFO)

# filter out Tableau (non-root) logging
logging.config.dictConfig({
    'version': 1,
    'disable_existing_loggers': True,
})

Set variables

The job type allows for testing ahead of implementation. The other variables prepare credentials and server addresses.

# -------------------
# Settings
# -------------------
# Job type
# jobType = 'test'
jobType = 'apply'

# Set the Tableau server URL
tableau_url = 'http://whatever-this-is' 

# Set the Tableau server-client
server = TSC.Server(tableau_url)

# Set the Tableau API version
server.use_server_version()
api_version = server.version

# Set the Tableau sign-in user (needs to be server admin)
tableauSignIn = 'AD Service Account username'
tableauPassword = 'AD Service Account password'

# Set the Tableau Python API sign-in authorization (needs to be server admin)
initialTableauAuth = TSC.TableauAuth(tableauSignIn, tableauPassword)

# Set the Fusion server Index Profile URL (We're actually using Ludidworks Fusion here)
fusion_url = 'http://fusion-index-profile'  

# Set the Fusion API headers
headers = {'content-type': 'application/json', 'Accept-Charset': 'UTF-8', 'Authorization': 'Basic abc123DEFgiaa=='}

# Set the Solr server
host = 'http://solr server address'   
querystring = '*:*'
fields = 'id,title'
qt = 'select'
filterquery_views = '_lw_data_source_s:("Tableau-Azure-Views")'
filterquery_workbooks = '_lw_data_source_s:("Tableau-Azure-Workbooks")'
rows = '100000'

Cache data

Get all Tableau workbook and view IDs already in Solr.

# -------------------
#  Get all items from Solr Index
# -------------------
# Solr Core
logging.info('Caching Solr Data')
solr = pysolr.Solr(host, search_handler=qt, always_commit=True)
results_workbooks = solr.search(querystring, **{
    'fl': fields,
    'fq': filterquery_workbooks,
    'rows': rows,
    # 'hl': 'true',
    # 'hl.fragsize': 10,
})
solr = pysolr.Solr(host, search_handler=qt, always_commit=True)
results_views = solr.search(querystring, **{
    'fl': fields,
    'fq': filterquery_views,
    'rows': rows,
    # 'hl': 'true',
    # 'hl.fragsize': 10,
})

solr_cache_workbooks = {}
for row_workbook in results_workbooks:
    solr_cache_workbooks[row_workbook['id']] = (row_workbook['id'])
logging.info('Returned {0} workbooks from Solr'.format(len(results_workbooks)))
logging.info('Cached {0} workbooks from Solr'.format(len(solr_cache_workbooks)))

solr_cache_views = {}
for row_view in results_views:
    solr_cache_views[row_view['id']] = (row_view['id'])
logging.info('Cached {0} views from Solr'.format(len(solr_cache_views)))
logging.info('Returned {0} views from Solr'.format(len(results_views)))
logging.info('')

Cache more data

Cache all workbook and view assets from Tableau server, then send only tagged assets to Solr if the asset already exists in the Solr index.

# -------------------
#  Identify the sites
# -------------------
# Cache a list of site LUIDs
site_cache = {}
with server.auth.sign_in(initialTableauAuth):
    for luid in TSC.Pager(server.sites):
        site_cache[luid.id] = luid.name
    # logging.info("site luid is {" + "\n".join("{!r}: {!r},".format(k, v) for k, v in site_cache.items()) + "}")

# -------------------
#  Process the tags
# -------------------
# Set the site LUID used for the REST calls, e.g. '87cfedbd-70c5-4cf6-9cbf-a63e80cce7a2'
for key, value in site_cache.items():

    # Set the authorization string = TSC.TableauAuth('USERNAME', 'PASSWORD', site_id='CONTENTURL')
    # Note: CONTENTURL is "url_namespace" for the site in the Postgres Sites table
    # Note: CONTENTURL has no spaces
    # Note: CONTENTURL for the "Default" site is an empty string
    # Note: CONTENTURL "Default" site is not always named "Default"
    # Note: CONTENTURL does not handle being a variable with empty quotes, so create with quotes below
    if (value.lower() == 'default') or (value.lower() == 'sjhs tableau site'):
        tableauAuth = TSC.TableauAuth(tableauSignIn, tableauPassword, site_id="")
    else:
        tableauAuth = TSC.TableauAuth(tableauSignIn, tableauPassword, site_id=value.replace(' ', ''))

    # Sign in to site
    with server.auth.sign_in(tableauAuth):

        logging.info('Caching Tableau')

        # Cache all workbooks
        # -------------------
        workbook_cache = {}
        for workbook in TSC.Pager(server.workbooks):
            workbook_cache[workbook.id] = (workbook.name, workbook.tags)
        # logging.info("{" + "\n".join("{!r}: {!r},".format(k, v) for k, v in workbook_cache.items()) + "}")
        logging.info('Cached {0} workbooks from Tableau'.format(len(workbook_cache)))

        # Cache all views
        # -------------------
        view_cache = {}
        for view in TSC.Pager(server.views):
            view_cache[view.id] = (view.name, view.tags)
        # logging.info("{" + "\n".join("{!r}: {!r},".format(k, v) for k, v in view_cache.items()) + "}")
        logging.info('Cached {0} views from Tableau'.format(len(view_cache)))

        # Build workbook datasets
        # -------------------
        logging.info('')
        logging.info('Build workbooks dataset')
        workbook_set = {x: workbook_cache[x] for x in workbook_cache if x in solr_cache_workbooks}
        logging.info('Built dataset with {0} workbooks'.format(len(workbook_set)))

        # Build view datasets
        # -------------------
        logging.info('')
        logging.info('Build views dataset')
        view_set = {x: view_cache[x] for x in view_cache if x in solr_cache_views}
        logging.info('Built dataset with {0} views'.format(len(view_set)))

        # POST all workbook tags to Fusion
        # -------------------
        logging.info('')
        logging.info('Sending all workbook tags to Fusion')

        for w_key, w_value in workbook_set.items():
            id = w_key
            name = w_value[0]
            tag = ', '.join(w_value[1])
            type = 'Workbook'

            if len(tag) > 1:
                data = '{"id": "' + id + '", "tag":"' + tag + '"}'
                runjob(name, tag, type, data)

        # POST view tags to Fusion
        # -------------------
        logging.info('')
        logging.info('Sending all view tags to Fusion')
        for v_key,v_value in view_set.items():
            id = v_key
            name = v_value[0]
            tag = ', '.join(v_value[1])
            type = 'View'

            if len(tag) > 1:
                data = '{"id": "' + id + '", "tag":"' + tag + '"}'
                runjob(name, tag, type, data)

# Sign out
response = requests.request("POST", tableau_url + '/api/' + api_version + '/auth/signout')
logger.info('All done and signed out')

Add the tags to Solr

Use a function to test or apply each POST to a Fusion Index Profile.

# -------------------
#  Rinse/Repeat Function
# -------------------
# Run job of each type
def runjob(name, tag, type, data):

    if jobType == 'test':
        # Run as TEST
        logging.info('Would have added tag "' + tag + '" to ' + type + ' "' + name + '"')

    else:
        # Run for REAL LIKE
        response = requests.post(fusion_url, headers=headers, data=data)
        logging.info(response.text.encode('utf8'))

References