Saturday, February 9, 2013

Generators Used as Counters

I've a bit late to the Python generator party.  My main problem in adoption was my lack of understanding of where I could apply the generator function concept.  I always wanted to apply this tool to counters in a counting based problem and I recently had an opportunity to do so.

Confession:  every time I do a counter based problem, I end up with off by one errors, so much so that I have to go through at least three iterations to get it right.

Generator functions used in part as counters are what got me through this.  The script was a mess and the use of the yield statement inside a couple functions cleaned it up for me to the point where it only took one more iteration to get it right (I had spent a weekend and about six iterations prior to this).

The main utility of the yield statement for me has three parts:

    1) it "remembers" where it is and where it stopped within the generator between next() calls.

    2) it "remembers" the values within itself (in this case a counter) between next() calls.

    3) because of 1) and 2) it's harder (for me, at least) to corrupt values and state than it is by passing around a counter variable and applying += operator calls.

The (somewhat convoluted) scenario:

I'm working at a mine in Africa.  The boss comes to me with a spreadsheet of about 500 items from various data sources and requests a database and a report that looks just like it and updates any time it's called.  Difficult, but not impossible.  I start on a schema for a database.  Considering I'm a geologist, it's not all bad:


 









Coworker comes by not long after and informs us that he is not keen on the idea of a custom database and that the problem should be done using SharePoint lists as lookup tables.  MS SQL 2008 r2 SSRS (SQL Server Reporting Services) is what we and our mine software vendor use for reporting.

I start to work on the SharePoint lists and hit a snag.  SSRS in its current state can only operate on a single dataset when you're presenting information in a table.  My main dataset in this case is the list of performance indicators or assays my boss asked for in the first place.  That list is in a SharePoint list.  Further, SSRS can only take one SharePoint list as a data source.  In other words, to the best of my knowledge, you can not write a relational query across SharePoint lists in SSRS 2008 r2.  My solution was to lookup values within my table in SSRS based on names or keys.  It got ugly and string concatenation was involved, but, amazingly, it worked.

The only real problem, apart from lousy performance, was the matching of shipment data from the mine based on mine levels as required in the spreadsheet.  What I needed to do was create a separate record in the main KPI list for each KPI-pit-material(roughly economic character of the rock)-bench-subbench combination.  There was a last column dealing with mining method, but for the time being I could get that from the subbench number.

Anyway, 4 or 5 data sources requiring elevations, 4 (currently, there will be more) open pits, 5 KPI's, 9 or 10 materials, about 40 benches, 9 subbenches - it ends up being a lot of records (5 * 4 * 5 * 10 * 40 * 9 ~= 360,000).  In the end I had about 350,000 records total.  I actually started doing this by hand in an Excel spreadsheet.  I got as far as replicating everything for the four pits and knew I would have to do the rest programmatically.  Knowing what I know now, I would have done the whole thing programmatically.  Live and learn.

Shown below is the code (sorry about the plain look - I'm not with it with Syntax Highlighter and other tools yet).  There are some naive constructs and hacks, but it works.  Thanks for having a look.

"""
Deal with main list of KPI's in
Jeff's reconciliation report.
Deal with elevation, flitch, and mining method here.
Assign key values (integers) to
relevant columns.
"""

import csv
import collections as colx
import os
import sys
import pprint

# ranges that require bench/flitch addition
import benchranges as br

# bench range file format:
#
# """
# Row ranges in <boss's> Excel spreadsheet
# that require repetition for bench and flitch.
# """
#
# RANGES = ((36, 42),
#           (42, 48),
#           (48, 54),
#           (54, 60),
#           (60, 66),
#           (66, 72),
#           (72, 78),
#           (78, 84),
#              .
#              .
#              .
#             etc.

DIRX = r'C:\MSPROJECTS\SiteTripJAN13\JeffDatabase\FirstAttemptWDatabase\LookupDataSetup'
DIRCSV = r'C:\MSPROJECTS\SiteTripJAN13\JeffDatabase\FirstAttemptWDatabase\LookupDataSetup\loadtosharepoint\spreadsheetsforload\cvsdumps'
CSV = '.csv'

KPIFIELDS = ['KPIID', 'KPIOrder', 'AttributeDescription', 'Units', 'DataType', 'Source', 'ValueCalc', 'Period', 'Location', 'Pit', 'Level', 'MiningMethod']

NTKPIFIELDS = ','.join(KPIFIELDS)

# account for extra level, flitch, and BHVermeer files
CSVFILES = KPIFIELDS[2:] + ['Benches', 'BHVermeer', 'Flitches']

MAINFILE = 'kpiids'

BIGNUMBER = 1000000

# OK, the only file that needs a named tuple is the KPI one.
# The others all have two columns where we're trying to match
# basic format of KPI lookup csv files (faux unit example):
#
# 1,Frobnostication unit
# 2,Foo unit
# 3,ounces
# 4,gallons per minute
#     .
#     .
#     .
#    etc.

# format of main input file:
#
# 1,1,Plant Harvest Salable Copper,Tonne Metal,Plant Actual,Mill Report,Value,Shift Day,Plant,Not Applicable,Not Applicable,Not Applicable
# 2,2,Plant Contained Salable Cobalt,Tonne Metal,Plant Actual,Mill Report,Value,Shift Day,Plant,Not Applicable,Not Applicable,Not Applicable
# 3,3,Mill Tonnage,Tonne Ore,Plant Actual,Mill Report,Value,Hour,Plant,Not Applicable,Not Applicable,Not Applicable
#     .
#     .
#     .
#    etc.

# intended format of output file:
#
# 1,1,120,10,4,2,2,3,3,3,2,42,10,2,4
# 2,2,119,10,4,2,2,3,3,3,2,42,10,2,4
# 3,3,108,11,4,2,2,1,3,3,2,42,10,2,4
#     .
#     .
#     .
#    etc.

# have to add slots for pit, bench, flitch, mining method
# 14 slots all integers
# XXX - very much hard coded
# 'KPIID', 'KPIOrder', 'AttributeDescription', 'Units'
# 'DataType', 'Source', 'ValueCalc', 'Period'
# 'Location', 'Pit', 'Level', Bench, Flitch'
# 'MiningMethod', BHVermeer

MAINFILELINEFMT = (11 * '{{{}}},').format(*[x for x in KPIFIELDS[:11]])
MAINFILELINEFMT += '{Bench},{Flitch},{MiningMethod},{BHVermeer}'

KPIID = 'KPIID'
KPIORDER = 'KPIOrder'
PIT = 'Pit'
BENCH = 'Bench'
FLITCH = 'Flitch'
BHVERMEER = 'BHVermeer'
NA = 'Not Applicable'

# specific for each item (1FEB2013)
NABENCH = 42
NAFLITCH = 10
NABHVERMEER = 4
NAPIT = 3
EXCEL = 'excel'

def makelookupdict():
    """
    Read the available csv files for the SharePoint
    lookup tables and make a corresponding dictionary
    of integer values corresponding to each value
    (string) key.

    Returns dictionary.
    """
    os.chdir(DIRCSV)
    # make lookup table of values
    lookupdict = {}
    for filex in CSVFILES:
        f = open(filex + CSV, 'r')
        lookupdict[filex] = {}
        readerx = csv.reader(f, dialect = EXCEL)
        for linex in readerx:
            keyx = linex[1]
            lookupdict[filex][keyx] = linex[0]
        f.close()
    return lookupdict

def newnum():
    """
    Generator function that will continuously
    increment an integer and yield the new
    value.
    """
    x = 1
    while 1:
        yield x
        x += 1

def initializerecord(recorddict, recordnt, lookupdict):
    """
    Put corresponding integer lookup values
    for fields in dictionary.

    recorddict is an empty dictionary.

    recordnt is a named tuple of record values.

    lookupdict is a dictionary of lookup indices
    for the values in question.
    """
    for namex in KPIFIELDS[2:]:
        # XXX - dictionary lookup is a bit hackish
        recorddict[namex] = lookupdict[namex][recordnt.__dict__[namex]]

def dummyfieldsforkpi(recorddict, n):
    """
    Fills in Not Applicable for fields in KPI
    that doesn't require them (bench, flitch, etc.)

    recorddict is the dictionary for the KPI
    in question.

    n is the KPI identifier and order number
    of original KPI file.
    """
    recorddict[BENCH] = NABENCH
    recorddict[FLITCH] = NAFLITCH
    recorddict[BHVERMEER] = NABHVERMEER
    recorddict[KPIID] = n
    recorddict[KPIORDER] = n

def dealwithbenchloop(n, counter, recorddicts, f2):
    """
    Generator function.

    Fill in rows for benches and flitches for a given
    set of grouped KPI's.

    n is the current line number of the file being
    written to.

    counter is the generator for incrementing line
    numbers is the file being written to.

    recorddicts is the base list of grouped record
    dictionaries.

    f2 is the file being written to.

    Returns the position the line number counter
    for the new file.
    """
    # loop through benches
    for benchx in range(1, 42):
        # loop through flitches, increment counter
        # 1-9 corresponds to flitch indices in original flitch list (1-9)
        for flitchx in range(1, 10):
            for recn in recorddicts:
                # newlineno will represent KPIID, KPIOrder, add to dict
                recn[KPIID] = n
                recn[KPIORDER] = n
                # match bench, flitch, bhvermeer with id's and add to dict
                recn[BENCH] = benchx
                recn[FLITCH] = flitchx
                if flitchx == 9:
                    recn[BHVERMEER] = 1
                else:
                    recn[BHVERMEER] = 2
                # write dictionary to file
                print(MAINFILELINEFMT.format(**recn), file = f2)
                n = next(counter)
                yield n

def cyclethroughkpis():
    """
    Generator function that will cycle through
    a list of KPI (key performance indicator)
    records and write each one to a new file
    adding records for benches and flitches
    where appropriate.
    """
    # get main file and assign lookup values
    os.chdir(DIRX)
    f = open(MAINFILE + CSV, 'r')
    f2 = open('kpilookups.csv', 'w')
    # named tuple for cycling through file
    recordx = colx.namedtuple('recordx', KPIFIELDS)
    # counter for new file
    x = newnum()
    # map object of records
    # want line number and named tuple records in an iterable
    ziplines = zip(range(1, BIGNUMBER), map(recordx._make, csv.reader(f, dialect = EXCEL)))
    # lookup dictionary
    lookupdict = makelookupdict()
    # multiple bench original file line index ranges
    benchidxs = (range(*z) for z in br.RANGES)
    # increment new file counter
    n = next(x)
    while 1:
        recorddict = {}
        try:
            zipnt = next(ziplines)
        except StopIteration:
            f.close()
            f2.close()
        initializerecord(recorddict, zipnt[1], lookupdict)
        # if this requires multiple repeated records for bench and flitch
        if int(recorddict[PIT]) != NAPIT:
            # list of grouped record dictionaries
            recorddicts = [recorddict]
            # get range of rows for grouped KPI's
            rowrange = next(benchidxs)
            # first item in rowrange already current
            for i in rowrange[1:]:
                zipnt = next(ziplines)
                recorddict = {}
                initializerecord(recorddict, zipnt[1], lookupdict)
                recorddicts.append(recorddict)
            r = dealwithbenchloop(n, x, recorddicts, f2)
            # track new file counter
            for w in r:
                yield w
            # track counter
            n = w
            yield n, zipnt[0], rowrange
        else:
            # make record complete
            dummyfieldsforkpi(recorddict, n)
            # write to file
            print(MAINFILELINEFMT.format(**recorddict), file = f2)
            # yield current file counter, original file counter, and row range
            yield n, zipnt[0], None
            n = next(x)

if __name__ == '__main__':
    dorecords = cyclethroughkpis()
    while 1:
        try:
            print(next(dorecords))
        except StopIteration:
            print('Finished run.')
            # XXX - a bit hackish
            sys.exit()
f.close()
f2.close()

5 comments:

  1. @Bjørnar Yes and no. The incrementing is the same, but the idea of the object maintaining its state between calls, if I understand things correctly, is a generator or yield characteristic.

    There are two main yield calls in cyclethroughkpis(), one for the count in the file being written to, and one for the line number of the original file. This is the utility of the yield statement to me; it allows me to track a couple things at once without having to worry about it once it's set up. This is probably not the simplest approach to the problem, but it works for me.

    Thanks for having a look. CBT

    ReplyDelete
  2. @Bjørnar I thought about this over breakfast and I now know what you were saying. The newnum() function is entirely unnecessary, and, if I'm not mistaken, so is range(1, BIGNUMBER) used elsewhere in the script. Next time I'll cut down on code and use itertools.count(1) like you suggested.

    I mentioned in the narrative that I probably had used some naive constructs. This is an example of one. This is a learning experience for me and I really appreciate you taking the time. CBT

    ReplyDelete
  3. Thanks for the post, I will look forward to see more posts from your blog and Adpower Generators.

    ReplyDelete