Monday, October 20, 2014

subprocess.Popen() or Abusing a Home-grown Windows Executable

Each month I redo 3D block model interpolations for a series of open pits at a distant mine.  Those of you who follow my twitter feed often see me tweet, "The 3D geologic block model interpolation chuggeth . . ."  What's going on is that I've got all the processing power maxed out dealing with millions of model blocks and thousands of data points.  The machine heats up and with the fan sounds like a DC-9 warming up before flight.

All that said, running everything roughly in parallel is more efficient time-wise than running it sequentially.  An hour of chugging is better than four.  The way I've been doing this is using the Python (2.7) subprocess module's Popen method, running my five interpolated values in parallel.  Our Python programmer Lori originally wrote this to run in sequence for a different set of problems.  I bastardized it for my own.

The subprocess part of the code is relatively straightforward.  Function startprocess() in my code covers that.

What makes this problem a little more challenging:

1) it's a vendor supplied executable we're dealing with . . . without an API or source . . . that's interactive (you can't feed it the config file path; it asks for it).  This results in a number of time.sleep() and <process>.stdin.write() calls that can be brittle.

2) getting the processes started, as I just mentioned, is easy.  Finding out when to stop, or kill them, requires knowledge of the app and how it generates output.  I've gone for an ugly, but effective check of report file contents.

3) while waiting for the processes to finish their work, I need to know things are working and what's going on.  I've accomplished this by reporting the data files' sizes in MB.

4) the executable isn't designed for a centralized code base (typically all scripts are kept in a folder for the specific project or pit), so it only allows about 100 character columns in the file paths sent to it.  I've omitted this from my sanitized version of the code, but it made things even messier than they are below.  Also, I don't know if all Windows programs do this, but the paths need to be inside quotes - the path kept breaking on the colon (:) when not quoted.

Basically, this is a fairly ugly problem and a script that requires babysitting while it runs.  That's OK; it beats the alternative (running it sequentially while watching each run).  I've tried to adhere to DRY (don't repeat yourself) as much as possible, but I suspect this could be improved upon.

The reason why I blog it is that I suspect there are other people out there who have to do the same sort of thing with their data.  It doesn't have to be a mining problem.  It can be anything that requires intensive computation across voluminous data with an executable not designed with a Python API.

Notes: 

1) I've omitted the file multirunparameters.py that's in an import statement.  It has a bunch of paths and names that are relevant to my project, but not to the reader's programming needs.

2) python 2.7 is listed at the top of the file as "mpython."  This is the Python that our mine planning vendor ships that ties into their quite capable Python API.  The executable I call with subprocess.Popen() is a Windows executable provided by a consultant independent of the mine planning vendor.  It just makes sense to package this interpolation inside the mine planning vendor's multirun (~ batch file) framework as part of an overall working of the 3D geologic block model.  The script exits as soon as this part of the batch is complete.  I've inserted a 10 second pause at the end just to allow a quick look before it disappears.

#!C:/MineSight/x64/mpython

"""
Interpolate grades with <consultant> program
from text files.
"""


import argparse
import subprocess as subx
import os
import collections as colx

import time
from datetime import datetime as dt


# Lookup file of constants, pit names, assay names, paths, etc.
import multirunparameters as paramsx


parser = argparse.ArgumentParser()
# 4 letter argument like 'kwat'
# Feed in at command line.
parser.add_argument('pit', help='four letter, lower case pit abbreviation (kwat)', type=str)
args = parser.parse_args()
PIT = args.pit


pitdir = paramsx.PATHS[PIT]
pathx = paramsx.BASEPATH.format(pitdir)
controlfilepathx = paramsx.CONTROLFILEPATH.format(pitdir)


timestart = dt.now()
print(timestart)


PROGRAM = 'C:/MSPROJECTS/EOMReconciliation/2014/Multirun/AllPits/consultantprogram.exe'

ENDTEXT = 'END <consultant> REPORT'

# These names are the only real difference between pits.
# Double quote is for subprocess.Popen object's stdin.write method
# - Windows path breaks on colon without quotes.
ASSAY1DRIVER = 'KDriverASSAY1{:s}CBT.csv"'.format(PIT)
ASSAY2DRIVER = 'KDriverASSAY2{:s}CBT.csv"'.format(PIT)
ASSAY3DRIVER = 'KDriverASSAY3_{:s}CBT.csv"'.format(PIT)
ASSAY4DRIVER = 'KDriverASSAY4_{:s}CBT.csv"'.format(PIT)
ASSAY5DRIVER = 'KDriverASSAY5_{:s}CBT.csv"'.format(PIT)


RETCHAR = '\n'

ASSAY1 = 'ASSAY1'
ASSAY2 = 'ASSAY2'
ASSAY3 = 'ASSAY3'
ASSAY4 = 'ASSAY4'
ASSAY5 = 'ASSAY5'


NAME = 'name'
DRFILE = 'driver file'
OUTPUT = 'output'
DATFILE = 'data file'
RPTFILE = 'report file'


# data, report files
ASSAY1K = 'ASSAY1K.csv'
ASSAY1RPT = 'ASSAY1.RPT'

ASSAY2K = 'ASSAY2K.csv'
ASSAY2RPT = 'ASSAY2.RPT'

ASSAY3K = 'ASSAY3K.csv'
ASSAY3RPT = 'ASSAY3.RPT'

ASSAY4K = 'ASSAY4K.csv'
ASSAY4RPT = 'ASSAY4.RPT'

ASSAY5K = 'ASSAY5K.csv'
ASSAY5RPT = 'ASSAY5.RPT'


OUTPUTFMT = '{:s}output.txt'

ASSAYS = {1:{NAME:ASSAY1,
             DRFILE:controlfilepathx + ASSAY1DRIVER,
             OUTPUT:pathx + OUTPUTFMT.format(ASSAY1),
             DATFILE:pathx + ASSAY1K,
             RPTFILE:pathx + ASSAY1RPT},
          2:{NAME:ASSAY2,
             DRFILE:controlfilepathx + ASSAY2DRIVER,
             OUTPUT:pathx + OUTPUTFMT.format(ASSAY2),
             DATFILE:pathx + ASSAY2K,
             RPTFILE:pathx + ASSAY2RPT},
          3:{NAME:ASSAY3,
             DRFILE:controlfilepathx + ASSAY3DRIVER,
             OUTPUT:pathx + OUTPUTFMT.format(ASSAY3),
             DATFILE:pathx + ASSAY3K,
             RPTFILE:pathx + ASSAY3RPT},
          4:{NAME:ASSAY4,
             DRFILE:controlfilepathx + ASSAY4DRIVER,
             OUTPUT:pathx + OUTPUTFMT.format(ASSAY4),
             DATFILE:pathx + ASSAY4K,
             RPTFILE:pathx + ASSAY4RPT},
          5:{NAME:ASSAY5,
             DRFILE:controlfilepathx + ASSAY5DRIVER,
             OUTPUT:pathx + OUTPUTFMT.format(ASSAY5),
             DATFILE:pathx + ASSAY5K,
             RPTFILE:pathx + ASSAY5RPT}}


DELFILE = 'delete file'
INTERP = 'interp'
SLEEP = 'sleep'
MSGDRIVER = 'message driver'
MSGRETCHAR = 'message return character'
FINISHED1 = 'finished one assay'
FINISHEDALL = 'finished all interpolations'
TIMEELAPSED = 'time elapsed'
FILEEXISTS = 'report file exists'
DATSIZE = 'data file size'
DONE = 'number interpolations finished'
DATFILEEXIST = 'data file not yet there'
SIZECHANGE = 'report file changed size'


# for converting to megabyte file size from os.stat()
BITSHIFT = 20

# sleeptime - 5 seconds
SLEEPTIME = 5

FINISHED = 'finished'
RPTFILECHSIZE = """
        
Report file for {:s}
changed size; killing process . . .

"""

MESGS = {DELFILE:'\n\nDeleting {} . . .\n\n',
         INTERP:'\n\nInterpolating {:s} . . .\n\n',
         SLEEP:'\nSleeping 2 seconds . . .\n\n',
         MSGDRIVER:'\n\nWriting driver file name to stdin . . .\n\n',
         MSGRETCHAR:'\n\nWriting retchar to stdin for {:s} . . .\n\n',
         FINISHED1:'\n\nFinished {:s}\n\n',
         FINISHEDALL:'\n\nFinished interpolation.\n\n',
         TIMEELAPSED:'\n\n{:d} elapsed seconds\n\n',
         FILEEXISTS:'\n\nReport file for {:s} exists . . .\n\n',
         DATSIZE:'\n\nData file size for {:s} is now {:d}MB . . .\n\n',
         DONE:'\n\n{:d} out of {:d} assays are finished . . .\n\n',
         DATFILEEXIST:"\n\n{:s} doesn't exist yet . . .\n\n",
         SIZECHANGE:RPTFILECHSIZE}


def cleanslate():
    """
    Delete all output files prior to interpolation
    so that their existence can be tracked.
    """
    for key in ASSAYS:
        files = (ASSAYS[key][DATFILE],
                 ASSAYS[key][RPTFILE],
                 ASSAYS[key][OUTPUT])
        for filex in files:
            print(MESGS[DELFILE].format(filex))
            if os.path.exists(filex) and os.path.isfile(filex):
                os.remove(filex)
    return 0


def startprocess(assay):
    """
    Start <consultant program> run for given interpolation.

    Return subprocess.Popen object,
    file object (output file).
    """
    print(MESGS[INTERP].format(ASSAYS[assay][NAME]))
    # XXX - I hate time.sleep - hack
    # XXX - try to re-route standard output so that
    #       it's not all jumbled together.
    print(MESGS[SLEEP])
    time.sleep(2)
    # output file for stdout
    f = open(ASSAYS[assay][OUTPUT], 'w')
    procx = subx.Popen('{0}'.format(PROGRAM), stdin=subx.PIPE, stdout=f)
    print(MESGS[SLEEP])
    time.sleep(2)
    # XXX - problem, starting up Excel CBT 22JUN2014
    #       Ah - this is what happens when the <software usb licence>
    #            key is not attached :-(
    print(MESGS[MSGDRIVER])
    print('\ndriver file = {:s}\n'.format(ASSAYS[assay][DRFILE]))
    procx.stdin.write(ASSAYS[assay][DRFILE])
    print(MESGS[SLEEP])
    time.sleep(2)
    # XXX - this is so jacked up -
    #       no idea what is happening when
    print(MESGS[MSGRETCHAR].format(ASSAYS[assay][NAME]))
    procx.stdin.write(RETCHAR)
    print(MESGS[SLEEP])
    time.sleep(2)
    print(MESGS[MSGRETCHAR].format(ASSAYS[assay][NAME]))
    procx.stdin.write(RETCHAR)
    print(MESGS[SLEEP])
    time.sleep(2)
    return procx, f


def crosslookup(assay):
    """
    From assay string, get numeric
    key for ASSAYS dictionary.

    Returns integer.
    """
    for key in ASSAYS:
        if assay == ASSAYS[key][NAME]:
            return key
    return 0


def checkprocess(assay, assaydict):
    """
    Check to see if assay
    interpolation is finished.

    assay is the item in question
    (ASSAY1, ASSAY2, etc.).

    assaydict is the operating dictionary
    for the assay in question.

    Returns True if finished.
    """
    # Report file indicates process finished.
    assaykey = crosslookup(assay)
    rptfile = ASSAYS[assaykey][RPTFILE]
    datfile = ASSAYS[assaykey][DATFILE]
    if os.path.exists(datfile) and os.path.isfile(datfile):
        # Report size of file in MB.
        datfilesize = os.stat(datfile).st_size >> BITSHIFT
        print(MESGS[DATSIZE].format(assay, datfilesize))
    else:
        # Doesn't exist yet.
        print(MESGS[DATFILEEXIST].format(datfile))
    if os.path.exists(rptfile) and os.path.isfile(rptfile):
        # XXX - not the most efficient way,
        #       but this checking the file appears
        #       to work best.
        f = open(rptfile, 'r')
        txt = f.read()
        f.close()
        # XXX - hack - gah.
        if txt.find(ENDTEXT) > -1:
            # looking for change in reportfile size
            # or big report file
            print(MESGS[SIZECHANGE].format(assay))
            print(MESGS[SLEEP])
            time.sleep(2)
            return True
    return False


PROCX = 'process'
OUTPUTFILE = 'output file'


# Keeps track of files and progress of <consultant program>.
opdict = colx.OrderedDict()


# get rid of preexisting files
cleanslate()


# start all five roughly in parallel
# ASSAYS keys are numbers
for key in ASSAYS:
    # opdict - ordered with assay names as keys
    namex = ASSAYS[key][NAME]
    opdict[namex] = {}
    assaydict = opdict[namex]
    assaydict[PROCX], assaydict[OUTPUTFILE] = startprocess(key)
    # Initialize active status of process.
    assaydict[FINISHED] = False


# For count.
numassays = len(ASSAYS)
# Loop until all finished.
while True:
    # Cycle until done then break.
    # Sleep SLEEPTIME seconds at a time and check between.
    time.sleep(SLEEPTIME)
    # Count.
    i = 0
    for key in opdict:
        assaydict = opdict[key]
        if not assaydict[FINISHED]:
            status = checkprocess(key, assaydict)
            if status:
                # kill process when report file changes
                opdict[key][PROCX].kill()
                assaydict[FINISHED] = True
                i += 1
        else:
            i += 1
    print(MESGS[DONE].format(i, numassays))
    # all done
    if i == numassays:
        break


print('\n\nFinished interpolation.\n\n')
timeend = dt.now()
elapsed = timeend - timestart


print(MESGS[TIMEELAPSED].format(elapsed.seconds))
print('\n\n{:d} elapsed minutes\n\n'.format(elapsed.seconds/60))


# Allow quick look at screen.
time.sleep(10)



Sunday, October 12, 2014

Downloading a Bunch of MP3's off the Internet (Foreign Language Tapes)

A mining bud Jen wrote a blog post lamenting the difficulty of learning a foreign language as an adult in a far off land.  This inspired me to clean up my "download the Foreign Service Institute" French "tapes" (mp3's, actually) script I wrote for myself and publish it.

I'm not very astute on web programming.  This script came out of necessity.  There may be other, more efficient ways to do this.  If you have a slow connection a piecemeal approach will probably be required.  It took about 20 minutes to get all these files over a decent Verizon MIFI unit connection (I, unfortunately, don't have speed metrics available).

Notes about the downloaded product:  the US State Department's language tapes and lessons were mostly written and produced 30 to 50 years ago.  It's not Rosetta Stone, but I have found them to have value when it comes to practicing pronunciation, including cadence and rhythm of the foreign language - things you just can't get from printed or displayed text.

My late wife gifted me some Spanish tapes prior to the internet age that helped me out.  I am by no means fluent in Spanish, but I can say Hacemos lo que podemos hasta que nos boten (this may not be entirely grammatically correct) to the Spanish speaking mining engineers and get a laugh.




The original names of the mp3's are unnecessarily long and have the appearance of having been created by the Department of Redundancy Department.  It's a government thing, but it does not reflect on the quality of the product.  While the tapes at times are socialogically and technologically dated in their subject matter, the foreign languages haven't changed all that much.



The script:  I used Python 3.4 with the urllib module's request method.  The main challenge was getting the url's of the mp3's right.  The names are not entirely consistent.  For help with this (I am using Firefox 24.3.0 on OpenBSD 5.4), I right clicked on the mp3's link and selected Inspect Element from the drop down menu:



The lower left window has the href and the link to the mp3 - if your script is not able to find the file, this is a convenient place to look.

This is the whole thing:


#!python3.4

from urllib import request

# For getting foreign language study mp3's.
# Main part of URL for French.
BASEURL = 'http://www.fsi-language-courses.org/Courses/'
MIDDLEURLI = 'French/Basic (Revised)/Volume {volume}/'
MIDDLEURLII = 'French/Basic (Revised)/Volume {0:s}/'
BASEURLEND = 'FSI - French Basic Course (Revised) '

# Format changes inexplicably at chapter 19.
# Grrrr . . .
URLI = BASEURL + MIDDLEURLI + BASEURLEND
URLI += '- Volume {volume} - Unit {unit:0>2d} '
URLI += '{unit:0>2d}.{section:0>2d}.mp3'

URLII = BASEURL + MIDDLEURLII + BASEURLEND
URLII += '- Volume {1[volume]:d} - Unit {1[unit]:0>2d} '
URLII += '{1[unit]:0>2d}.{1[section]:d}.mp3'

# Format for actual name of mp3 files.
# This is what I wanted for a name - your
# preferences may be different - adjust
# accordingly.
FILENAME = '{unit:0>2d}{section:0>2d}.mp3'

# Texts (pdf format).
# Everything the State Dept. does is a 'StudentText' -
# fair enough.
STUDENTTXT = 'StudentText.pdf'

PDFURLBASICTEXT1 = 'http://ia601400.us.archive.org/28/items/'
PDFURLBASICTEXT1 += 'Fsi-FrenchBasicCourserevised-StudentText/'
PDFURLBASICTEXT1 += 'Fsi-FrenchBasicCourserevised-Volume1-'

PDFURLBASICTEXT2 = 'http://ia801400.us.archive.org/28/items/'
PDFURLBASICTEXT2 += 'Fsi-FrenchBasicCourserevised-StudentText/'
PDFURLBASICTEXT2 += 'Fsi-FrenchBasicCourserevised-Volume2-'

PDFURLMONDEFR = 'http://ia600406.us.archive.org/3/items/'
PDFURLMONDEFR += 'Fsi-LeMondeFrancophone/Fsi-LeMondeFrancophone-'

TWO = 'Two'

# Tack on StudentText.pdf to end.
pdfs = [PDFURLBASICTEXT1, PDFURLBASICTEXT2, PDFURLMONDEFR]
pdfs = [pdfx + STUDENTTXT for pdfx in pdfs]
myfilenames = ['basictext1.pdf', 'basictext2.pdf', 'mondefrancophone.pdf']
# I'm using the dictionary keys for filenames.
pdfs = dict(zip(myfilenames, pdfs))

VOLUME = 'volume'
UNIT = 'unit'
SECTION = 'section'

# volume key, then list of two tuples of unit and
# number of sections
VOLUMES = {1:[(1, 6), (2, 6), (3, 6), (4, 7), (5, 7),
              (6, 3), (7, 11), (8, 10), (9, 11), (10, 9),
              (11, 9), (12, 4)],
           2:[(13, 8), (14, 9), (15, 10), (16, 9), (17, 11),
              (18, 7), (19, 9), (20, 8), (21, 8), (22, 7),
              (23, 8), (24, 6)]}

mp3s = []
for key in VOLUMES:
    for unitsection in VOLUMES[key]:
        for x in range(1, unitsection[1] + 1):
            mp3s.append({VOLUME:key, UNIT:unitsection[0], SECTION:x})

for mp3x in mp3s:
    # Name format change at chapter 19 :-(
    if mp3x[UNIT] > 18:
        urlx = URLII.format(TWO, mp3x)
    else:
        urlx = URLI.format(**mp3x)
    filenamex = FILENAME.format(**mp3x)
    print('Retrieving {0} . . .'.format(urlx))
    request.urlretrieve(urlx, filenamex)

# Add pdf texts at end.
for pdfx in pdfs:
    print('Retrieving {0} . . .'.format(pdfx))
    request.urlretrieve(pdfs[pdfx], pdfx)

print('Everything appears to have downloaded.')
print('Check the directory with the files to be sure.')
 
As for my French efforts, I've had better luck downloading this stuff than I have learning it.  Nonetheless, a quick message to Guido van Rossum and the other core devs:  transmettez-leur mon meilleur souvenir.

Monday, October 6, 2014

Event report: pycon.za

I managed to squeeze in a 4 day stop in Johannesburg on a recent trip that happily coincided with pycon.za.  I love pycon.us and all the other big conferences, but for value, these smaller localized cons can't be beat.

Venue:  The Campus, Bryanston

Not your average office park.  It's nicely landscaped and has a huge center beach or pitch or lawn (depending where you're from).  The buildings are all named after famous sports venues like Lemans.  The nod to us Yanks (NOT New York Yankees) in Wrigley Field was a nice touch.





Best of all 100MB/day of internet for all who enter.  That's not ideal if you're wanting to watch Youtube videos, but plenty if you just want to check a speaker bio or do con-related stuff.  I thought the organizers did a great job of keeping the con inexpensive but valuable.

The catered food and drinks were really good, by my standards at least.

Apart from an unfortunate plumbing problem in the men's bathroom the second day that was quickly repaired, everything went off without a hitch.

Talks that I went to:

Ludell-Doughtie Writing Python Code to Decide an Election Keynote - he outlined the methodology and process they used during a recent (Libyan? - there was Arabic right-to-left text in the data) election.

The main take-aways for me were
  1. Use pre-written, open source software packages to standardize things, because you won't have time to roll your own or dink with inconsistent data/code formats when you are in the thick of it.
  2. It's a huge responsibility to write code for an election and manage the data, but it's a cool project.
Steve Crawford Enabling Science with the Southern African Large Telescope with Python Doctor Crawford didn't show a lot of code in this talk, but he did outline the architecture for getting information and moving it around. The scope of the talk was way too big for code samples, but that's OK. I left feeling . . . shall we say . . . inspired . . .

My main takeaways:

Astronomy is wickedly cool and based on instrumentation, precision, and data paucity and, ironically, an overabundance of data (on average about 10GB/day, up to 50GB/day). Crawford mentioned more than once the desperate need to "catch as many photons as possible because there are so few coming in." Yeah, photons, like particles of light, just wow.

Python is used for everything where it is appropriate to use it. There are plenty of problems that don't require you to be a genius rocket scientist like Crawford.  sysadmin, data, and, perhaps most importantly, web. They're using MySQL and a web frontend to distribute data throughout the world on a daily basis to other astronomers who need it. I'm always biased toward raw data myself; it is critical, but if you can't distribute it, it's not worth much.

Good talk for me to attend.

Albert Nel - Using Python in Blender Nel is a total joker (in a respectful, entertaining, good way), but not enough of a joker to bely a serious love and enthusiasm for both Python and Blender.

My own experience with rendering 3D stuff is a little dinking around with POV-ray.  Blender is different in that it's big on animations and honoring the laws of physics.  Writing Python to automate Blender is similar to, for lack of a better analogy, writing or recording VBA macros in Excel.

Nel did a lotto ball live demo and a Lego movie ocean demo (aside:  I *LOVE* live demos, even when they go wrong - it's one of the best parts of Open Source conferences versus say, a godawful boring company Powerpoint presentation - thank you to the Nelster for accomodating us).

My takeaways:

Blender is fun.

Allison Randal The Earth is not Flat (and Other Heresies) Keynote - a lot of times I don't relate a lot to keynotes because it's about super high level programmer craft stuff (disclaimer:  I've worked as a dev, but I'm a geologist by trade) that I can't really control or understand.

So my mind wandered as Randal gracefully moved about the stage in her pixie frame and calmly laid down her knowledge.  As I much younger man I would have been thinking, "She's so smart . . . and a very attractive individual to boot . . ."  As a curmudgeony old fart my thoughts go more towards the "Damn - she's in perfect shape, speaks well, and knows what the hell she's talking about.  I'm SOOO jealous; why can't I be like that?"  In all seriousness, what always blows me away when I see Randal talk is the calm, matter of fact way she just presents facts and opinions without any malice or belligerence.

At one point she responded to a question by saying essentially, "Don't use AWS; use OpenStack <if you want to accomplish X>."  Amazon was one of the three top corporate sponsors of the event, but it wasn't a SPEAK TRUTH TO POWER/VIVE LA REVOLUCION kind of thing, just a "this is what I think based on what I know."

I'm glad she's with "us" (the open source community) instead of selling her soul to the commercial world (which she could do at great profit).

Takeaway (tongue-in-cheek) - my view of me vis a vis Allison Randal (I'm the guy on the right).

They say "kill your heroes."  Until I drop 40 lbs. and learn to express my ideas in a less conflict ridden manner, I am not ready to kill anything.  Sorry, Ms. Randal.  I hope this isn't too creepy, but you're going to remain the queen on my hero pedestal for a while :-\

Dr. David Mertz What I Learned About Python - and About Guido's Time Machine - from Reading the Python-Ideas Mailing List Keynote - David took an example of an idea for a sum function for lists and walked through all the considerations of sanity, performance, implementation, and ultimate rejection.

My takeaways:

  1. The idea has to be intuitive and make sense (he actually experimented with this socialogically - that was kind of cool).
  2. The implementation has to be consistent.
  3. Performance matters (a lot).
  4. 1 trumps 2 and 3.

Adrianna Pińska An Introduction to Regular Expressions in Python Don't let the name fool you; this Polish lady speaks the Queen's English quite well.  She apologized (sort of) ahead of time saying she would talk too fast, but, really, the talk was paced just right.  I was really happy having gone to it.

My takeaways (for regex):

  1. Start with very general matches (.* for example) and work towards specific matches to gain skill and confidence.
Ridhwana Khan A Journey Through the Eyes of a Newbie Female Developer Very positive, professional talk, especially for a youngster.

(Aside: it's none of my business, but I think Ms. Khan is Muslim - she wore this really cool black-red combination outfit with a red head scarf - I borked my picture with my point and shoot camera, but I think a video of the talk is online.  Anyway, for a diversity-oriented talk, the outfit was not only cool and classy, but perfect for a South African con).

Ridhwana's talk was well structured with some humor interjected.  She started out with the most important point - that she loves coding and wants to do this for a career.  There were a number of valid points and ideas put forward - it's worth checking it out online.

My main takeaway:  IIRC not once did Ridhwana mention a Code of Conduct policy nor did she dwell on personal experiences with harassment.  Essentially, she has had a pretty good experience with colleagues thus far.  After a year with an all male crew (her excepted), she learned that prior to her arrival, firm rules had been established regarding off-color humor (basically banned) and such.  For me, this is a pretty good example of how some firm (but not excessively draconian) rules can help make programmer-land a women friendly place.  Ridhwana's point was that (at least in South African society) this is typically how relationships go anyway.  You meet someone, then after some time you get to know them better, and at that time, you can loosen up a bit more as appropriate.

Hallway track:  there were fewer than 150 people at this con IIRC, so if you wanted to talk to anyone, there was time.  People involved with the new kilometer array telescope project, people involved with the older telescopes northeast of Cape Town, speakers, Dr. Mertz, Allison Randal, a PhD in computational mathematics who specializes in computer vision, South African devs, the organizers of the conference - where else could a grunt open pit mine geologist like me have access to such luminosity?  pycon.za is pretty sweet.

Monday, September 1, 2014

PDF - Removing Pages and Inserting Nested Bookmarks

I blogged before about PyPDF2 and some initial work I had done in response to a request to get a report from Microsoft SQL Server Reporting Services into PDF format.  Since then I've had better luck with PyPDF2 using it with Python 3.4.  Seldom do I need to make any adjustments to either the PDF file or my Python code to get things to work.

Presented below is the code that is working for me now.  The basic gist of it is to strip the blank pages (conveniently SSRS dumps the report with a blank page every other page) from the SSRS PDF dump and reinsert the bookmarks in the right places in a new final document.  The report I'm doing is about 30 pages, so having bookmarks is pretty critical for presentation and usability.

The approach I took was to get the bookmarks out of the PDF object model and into a nested dictionary that I could understand and work with easily.  To keep the bookmarks in the right order for presentation I used collections.OrderedDict instead of just a regular Python dictionary structure.  The code should work for any depth level of nested parent-child PDF bookmarks.  My report only goes three or four levels deep, but things can get fairly complex even at that level.

There are a couple artifacts of the actual report I'm doing - the name "comparisonreader" refers to the subject of the report, a comparison of accounting methods' results.  I've tried to sanitize the code where appropriate, but missed a thing or two.

It may be a bit overwrought (too much code), but it gets the job done.  Thanks for having a look.

#!C:\python34\python

"""
Strip out blank pages and keep bookmarks for
SQL Server SSRS dump of model comparison report (pdf).
"""


import PyPDF2 as pdfimport math
from collections import OrderedDict

INPUTFILE = 'SSRSdump.pdf'
OUTPUTFILE = 'Finalreport.pdf'

OBJECTKEY = '/A'
LISTKEY = '/D'


# Adobe PDF document element keys.
FULLPAGE = '/Fit'
PAGE = '/Page'
PAGES = '/Pages'
ROOT = '/Root'
KIDS = '/Kids'
TITLE = '/Title'


# Python/PDF library types.
NODE = pdf.generic.Destination
CHILD = list


ADDPAGE = 'Adding page {0:d} from SSRS dump to page {1:d} of new document . . .'

# dictionary keys
NAME = 'name'
CHILDREN = 'children'


INDENT = 4 * ' '

ADDEDBOOKMARK = 'Added bookmark {0:s} to parent bookmark {1:s} at depthlevel {2:d}.'

TOPLEVEL = 'TOPLEVEL'

def getpages(comparisonreader):
    """
    From a PDF reader object, gets the
    page numbers of the odd numbered pages
    in the old document (SSRS dump) and
    the corresponding page in the final
    document.

    Returns a generator of two tuples.
    """
    # get number of pages then get odd numbered pages
    # (even numbered indices)
    numpages = comparisonreader.getNumPages()
    return ((x, int(x/2)) for x in range(numpages) if x % 2 == 0)


def fixbookmark(bookmark):
    """
    bookmark is a PyPDF2 bookmark object.

    Side effect function that changes bookmark
    page display mode to full page.
    """
    # getObject yields a dictionary
    props = bookmark.getObject()[OBJECTKEY][LISTKEY][1] = pdf.generic.NameObject(FULLPAGE)
    return 0


def matchpage(page, pages):
    """
    Find index of page match.

    page is a PyPDF2 page object.
    pages is the list (PyPDF2 array) of page objects.
    Returns integer page index in new (smaller) doc.
    """
    originalpageidx = pages.index(page)
    return math.floor((originalpageidx + 1)/2)


def pagedict(bookmark, pages):
    """
    Creates page dictionary for PyPDF2 bookmark object.

    bookmark is a PDF object (dictionary).
    pages is a list of PDF page objects (dictionary).
    Returns two tuple of a dictionary and
    integer page number.
    """
    page = matchpage(bookmark[PAGE].getObject(), pages)
    title = bookmark[TITLE]
    # One bookmark per page per level.
    lookupdict = OrderedDict()
    lookupdict.update({page:{NAME:title,
                             CHILDREN:OrderedDict()}})
    return lookupdict, page


def recursivepopulater(bookmark, pages):
    """
    Fills in child nodes of bookmarks
    recursively and returns dictionary.
    """
    dictx = OrderedDict()
    for pagex in bookmark:
        if type(pagex) is NODE:
            # get page info and update dictionary with it
            lookupdict, page = pagedict(pagex, pages)
            dictx.update(lookupdict)
        elif type(bookmark) is CHILD:
            newdict = OrderedDict()
            newdict.update(recursivepopulater(pagex, pages))
            dictx[page][CHILDREN].update(newdict)
    return dictx


def makenewbookmarks(pages, bookmarks):
    """
    Main function to generate bookmark dictionary:

    {page number: {name:<name>,
                   children:[<more bookmarks>]},
                   and so on.

    Returns dictionary.
    """
    dictx = OrderedDict()
    # top level bookmarks
    # it's going to go bookmark, list, bookmark, list, etc.
    for bookmark in bookmarks:
        if type(bookmark) is NODE:
            # get page info and update dictionary with it
            lookupdict, page = pagedict(bookmark, pages)
            dictx.update(lookupdict)
        elif type(bookmark) is CHILD:
            dictx[page][CHILDREN] = recursivepopulater(bookmark, pages)
    return dictx


def printbookmarkaddition(name, parentname, depthlevel):
    """
    Print notification of bookmark addition.

    Indentation based on integer depthlevel.
    name is the string name of the bookmark.
    parentname is the string name of the parent
    bookmark.

    Side effect function.
    """
    args = name, parentname, depthlevel
    indent = depthlevel * INDENT
    print(indent + ADDEDBOOKMARK.format(*args))


def dealwithbookmarks(comparisonreader, output, bookmarkdict, depthlevel, levelparent=None, parentname=None):
    """
    Fix bookmarks so that they are properly
    placed in the new document with the blank
    pages removed. Recursive side effect function.

    comparisonreader is the PDF reader object
    for the original document.


    output is the PDF writer object for the
    final document.


    bookmarkdict is a dictionary of bookmarks.

    depthlevel is the depth inside the nested
    dictionary-list structure (0 is the top).


    levelparent is the parent bookmark.

    parentname is the name of the parent bookmark.
    """
    depthlevel += 1
    for pagekeylevel in bookmarkdict:
        namelevel = bookmarkdict[pagekeylevel][NAME]
        levelparentii = output.addBookmark(namelevel, pagekeylevel, levelparent)
        if depthlevel == 0:
            parentname = TOPLEVEL
        printbookmarkaddition(namelevel, parentname, depthlevel)
        fixbookmark(levelparentii)
        # dictionary
        secondlevel = bookmarkdict[pagekeylevel][CHILDREN]
        argsx = comparisonreader, output, secondlevel, depthlevel, levelparentii, namelevel
        # Recursive call.
        dealwithbookmarks(*argsx)


def cullpages():
    """
    Fix SSRS PDF dump by removing blank
    pages.
    """
    ssrsdump = open(INPUTFILE, 'rb')
    finalreport = open(OUTPUTFILE, 'wb')
    comparisonreader = pdf.PdfFileReader(ssrsdump)
    pageindices = getpages(comparisonreader)
    output = pdf.PdfFileWriter()
    # add pages from SSRS dump to new pdf doc
    for (old, new) in pageindices:
        print(ADDPAGE.format(old, new))
        pagex = comparisonreader.getPage(old)
        output.addPage(pagex)

    # Attempt to add bookmarks from original doc
    # getOutlines yields a list of nested dictionaries and lists:
    #    outermost list - starts with parent bookmark (dictionary)
    #        inner list - starts with child bookmark (dictionary)       
    #                     and so on
    # The SSRS dump and this list have bookmarks in correct order.
    bookmarks = comparisonreader.getOutlines()
    # Get page numbers using this methodology (indirect object references)
    #
http://stackoverflow.com/questions/1918420/split-a-pdf-based-on-outline
    # list of IndirectObject's of pages in order
    pages = [pagen.getObject() for pagen in
            comparisonreader.trailer[ROOT].getObject()[PAGES].getObject()[KIDS]]
    # Bookmarks.
    # Top level is list of bookmarks.
    # List goes parent bookmark (Destination object)
    #               child bookmarks (list)
    #                   and so on.
    bookmarkdict = makenewbookmarks(pages, bookmarks)
    # Initial level of -1 allows increment to 0 at start.
    dealwithbookmarks(comparisonreader, output, bookmarkdict, -1)

    print('\n\nWriting final report . . .')
    output.write(finalreport)
    finalreport.close()
    ssrsdump.close()
    print('\n\nFinished.\n\n')


if __name__ == '__main__':
    cullpages()

Sunday, August 31, 2014

Internet Explorer 9 Save Dialog - SendKeys Last Resort

At work we use Internet Explorer 9 on Windows 7 Enterprise.  SharePoint is the favored software for filesharing inside organizational groups.  Our mine planning office is in the States; the mine operation whose data I work is in a remote, poorly connected location of the world.

Recently Sharepoint was updated to a new version at the mine.  The SharePoint server configuration there no longer allows Windows Explorer view or mapping of the site to a Windows drive letter.  I've put in a trouble ticket to regain this functionality, but that may take a while if it's possible.  Without it it is difficult to automate file retrieval or get more than one file at a time.

In the meantime I've been able to get the text based files over using win32com automation in Python to run Internet Explorer and grab the innerHTML object.  innerHTML is essentially the text of the files with tags around it.  I rip out the tags, write the text to a file on my harddrive and I'm good to go.

Binary files proved to be more difficult to download.  Shown below is a screenshot of the Internet Explorer 9 dialog box that goes by the generic name Notification Bar:

 
I googled and could nowhere find how this thing fit into the Internet Explorer 9 Document object hierarchy.  Then I came upon this colorful exchange between Microsoft Certified MVP's from 2012 that made things a little more clear.
 
It turns out you can't access the Notification Bar programatically per se.  What you can do is activate the specific Internet Explorer window and tab you're interested in, then send keystrokes to get where you want to, click, and download your file.
 
I'm not a web programmer nor am I a dedicated Windows programmer (I'm actually a geologist).  IEC is a small module that wraps some useful functionality - in my case identifying and clicking on the link on the SharePoint page by it's text identifier:
 
# C Python 2.7
 
# Internet Explorer module.
import IEC as iec
 
import time
 
ie = iec.IEController()
 
ie.Navigate(<URL of SharePoint page>)
# Give the page time to load (7 seconds).
time.sleep(7)
# I want to download file 11.msr.
ie.ClickLink('11')
# Give 5 seconds for the Notification Bar to show up.
time.sleep(5)
 
I'm fortunate in that our mine planning vendor, MineSight, ships Python 2.7 and associated win32com packages along with their software (their API's are written for Python).  If you don't have win32com and friends installed, they are necessary for this solution.
 
At this point I've just got to deal with that pesky Internet Explorer 9 Notification Bar.  As it turns out, SendKeys makes it doable (although neither elegant nor robust :-(   ):
 
# Activate the SharePoint page.
from win32com.client import Dispatch as dispx
shell = dispx('WScript.Shell')
shell.AppActivate(<name of IE9 tab>)
# Little pause.
time.sleep(0.5)
# Keyboard combination for the Notification Bar selection
# is ALT-N or '%n'
shell.SendKeys('%n', True)
# The Notification Bar goes to "Open" by default.
# You need to tab over to the "Save" button.
shell.SendKeys('{TAB}')
# Another little pause.
time.sleep(0.1)
# Space bar clicks on this control.
shell.SendKeys(' ', True)
 
The key combinations for accessing the Notification Bar are in Microsoft's documentation here
 
One link showing use of SendKeys is a German site (mostly English text) here.
 
And that's pretty much it.  There's another dialog that pops up in Internet Explorer 9 after the file is downloaded.  I've been able to blow that off so far and it hasn't gotten in the way as I move to the next download.  I give these files (about 300 kb) 15 seconds to download over a slow connection.  I may have to adjust that.
 
This solution is an abomination by any coding/architecture/durability standard.  Still, it's the abomination that is getting the job done for the time being.
 
Thanks for stopping by.
 
 

Friday, March 28, 2014

Editing a PDF file with Python (with a little help from PDFTKBuilder)

I'm working with a report published with SQL Server Reporting Services (SSRS).  The report is located on a remote server in Africa.  It is inconvenient for management in North America to view the report and print it from a browser (slow connection, formatting issues).  Instead, management would like a PDF file of the report to be e-mailed out to a distribution list.

This post deals with taking the PDF dumped from the SSRS web report and cleaning it up for viewing and navigation (bookmarking).  I didn't know a great deal about PDF's before working on this.  My ignorance will probably be reflected in the terminology I use and my approach.  Nonetheless, the problem was a bit more involved than I anticipated.  My intent is to put my experience out there and, if I have made things harder than necessary, get some feedback in the comments.

I think it's fair to say that SSRS is not a mature product yet, but, in a Microsoft/Windows environment its usefulness trumps that.  The dump to PDF or Excel feature for reports is handy, but doesn't always yield an output format consistent with the SSRS web report.  The first problem I had was a "corrupt" PDF dump.  The file opens fine in Acrobat Reader, but doesn't behave well when one attempts to copy its contents with modifications to another file with PyPDF2 (this is just a straight copy of pages from one pdf file to another new one):

Python 2.7.6 (default, Nov 10 2013, 19:24:24) [MSC v.1500 64 bit (AMD64)] on win32
Type "help", "copyright", "credits" or "license" for more information.
>>> import PyPDF2 as pdf
>>> dumpfile = open('baddumpfromssrs.pdf', 'rb')
>>> reader = pdf.PdfFileReader(dumpfile)
>>> numpages = reader.getNumPages()
>>> numpages
54
>>> outputfile = open('testoutput.pdf', 'wb')
>>> writer = pdf.PdfFileWriter()
>>> for x in xrange(numpages):
...     writer.addPage(reader.getPage(x))
...
>>> writer.write(outputfile)
Traceback (most recent call last):
  File "<stdin>", line 1, in <module>
  File "c:\python27\lib\site-packages\PyPDF2\pdf.py", line 279, in write
    self._sweepIndirectReferences(externalReferenceMap, self._root)
  File "c:\python27\lib\site-packages\PyPDF2\pdf.py", line 367, in _sweepIndirectReferences
    self._sweepIndirectReferences(externMap, realdata)
  File "c:\python27\lib\site-packages\PyPDF2\pdf.py", line 343, in _sweepIndirectReferences
    value = self._sweepIndirectReferences(externMap, value)
  File "c:\python27\lib\site-packages\PyPDF2\pdf.py", line 367, in _sweepIndirectReferences
    self._sweepIndirectReferences(externMap, realdata)
  File "c:\python27\lib\site-packages\PyPDF2\pdf.py", line 343, in _sweepIndirectReferences
    value = self._sweepIndirectReferences(externMap, value)
  File "c:\python27\lib\site-packages\PyPDF2\pdf.py", line 352, in _sweepIndirectReferences
    value = self._sweepIndirectReferences(externMap, data[i])
  File "c:\python27\lib\site-packages\PyPDF2\pdf.py", line 367, in _sweepIndirectReferences
    self._sweepIndirectReferences(externMap, realdata)
  File "c:\python27\lib\site-packages\PyPDF2\pdf.py", line 343, in _sweepIndirectReferences
    value = self._sweepIndirectReferences(externMap, value)
  File "c:\python27\lib\site-packages\PyPDF2\pdf.py", line 343, in _sweepIndirectReferences
    value = self._sweepIndirectReferences(externMap, value)
  File "c:\python27\lib\site-packages\PyPDF2\pdf.py", line 343, in _sweepIndirectReferences
    value = self._sweepIndirectReferences(externMap, value)
  File "c:\python27\lib\site-packages\PyPDF2\pdf.py", line 381, in _sweepIndirectReferences
    newobj = self._sweepIndirectReferences(externMap, newobj)
  File "c:\python27\lib\site-packages\PyPDF2\pdf.py", line 343, in _sweepIndirectReferences
    value = self._sweepIndirectReferences(externMap, value)
  File "c:\python27\lib\site-packages\PyPDF2\pdf.py", line 372, in _sweepIndirectReferences
    newobj = data.pdf.getObject(data)
  File "c:\python27\lib\site-packages\PyPDF2\pdf.py", line 1164, in getObject
    retval = readObject(self.stream, self)
  File "c:\python27\lib\site-packages\PyPDF2\generic.py", line 71, in readObject
    return DictionaryObject.readFromStream(stream, pdf)
  File "c:\python27\lib\site-packages\PyPDF2\generic.py", line 587, in readFromStream
    value = readObject(stream, pdf)
  File "c:\python27\lib\site-packages\PyPDF2\generic.py", line 91, in readObject
    return NumberObject.readFromStream(stream)
  File "c:\python27\lib\site-packages\PyPDF2\generic.py", line 257, in readFromStream
    return NumberObject(num)
ValueError: invalid literal for int() with base 10: ''
>>>


Bummer.  I had to google around until something showed up on Stackoverflow.  There's a comment on the post that suggests the use of pdftk to "un-corrupt" the file.  To avoid having to have an admin install something on my work computer, I downloaded PDFTKBuilder Portable.  This is really overkill, because I didn't need the user interface to clean up the file.  There is an App folder in the pdftk portable install that has the pdftk command line tool:

C:\blogdoc>pdftk baddumpfromssrs.pdf output good.pdf

This "worked" in terms of preparing the file to be dealt with with Python and PyPDF2, but not before I opened it in Adobe Reader and closed it.  I'm working in a corporate environment under Windows 7.  I double checked to see that I had the same command in the command window (I used the up arrow key to recall it to issue the command that worked).  I don't know what's going on there.  The important thing was that I could proceed with the non-corrupt file good.pdf.

While I'm on things that weren't working, I should probably mention the Python 3/Python 2 thing.  This originally gave an error on Python 3; when I tried to reproduce the problem, it hung forever and I had to kill it with Ctrl-C:

Python 3.4.0 (v3.4.0:04f714765c13, Mar 16 2014, 19:24:06) [MSC v.1600 32 bit (Intel)] on win32
Type "help", "copyright", "credits" or "license" for more information.
>>> import PyPDF2 as pdf
>>> inputfile = open('good.pdf', 'rb')
>>> reader = pdf.PdfFileReader(inputfile)
>>> pagex = reader.getPage(0)
>>> pagex.extractText()
Traceback (most recent call last):
  File "<stdin>", line 1, in <module>
  File "c:\python34\lib\site-packages\PyPDF2\pdf.py", line 2070, in extractText
    content = ContentStream(content, self.pdf)
  File "c:\python34\lib\site-packages\PyPDF2\pdf.py", line 2153, in __init__
    self.__parseContentStream(stream)
  File "c:\python34\lib\site-packages\PyPDF2\pdf.py", line 2173, in __parseContentStream
    operator += tok
KeyboardInterrupt


I'm a bit of a Python 3 advocate, sometimes even a zealot.  Still, pain won over my conviction and I switched to Python 2.7 where I got better results.  There is mention of this problem (with error) on StackOverflow.  A comment makes mention of replacing a couple PyPDF2 source files to make sure it runs with Python 3.  I couldn't find the link and took the expedient Python 2.7 route.

This is about where everything started to work the way it was supposed to.  Now I could get down to fixing the SSRS pdf report dump.  The first thing that needed to happen was the removal of a bunch of blank pages from the report.  On the SSRS web report they weren't there, but the PDF file had a blank page everywhere there was a page break.  Conveniently, this was every other page:

Python 2.7.6 (default, Nov 10 2013, 19:24:24) [MSC v.1500 64 bit (AMD64)] on win32
Type "help", "copyright", "credits" or "license" for more information.
>>> import PyPDF2 as pdf
>>> inputfile = open('good.pdf', 'rb')
>>> reader = pdf.PdfFileReader(inputfile)
>>> numpages = reader.getNumPages()
>>> numpages
54
>>> contentpages = (x for x in xrange(numpages) if x % 2 == 0)
>>> writer = pdf.PdfFileWriter()
>>> for n in contentpages:
...     pagex = reader.getPage(n)
...     writer.addPage(pagex)
...
>>> writer.write(outputfile)
>>> outputfile.close()
>>> inputfile.close()


Super!  Now I've got a 27 page document with content on every page.  The next thing I needed in my case was a banner or mark across each page saying, "DRAFT FORMAT."  The specific idea was that this was a sample report being circulated for comments and approval.

I didn't want super bold red text across the page, rather white text outlined in red.  Some googling paid off with a suggestion from a mailing list.  reportlab.pdfgen is the tool used for creating the file with the banner.  We'll merge it to the pages of the main report document later.

Python 2.7.6 (default, Nov 10 2013, 19:24:24) [MSC v.1500 64 bit (AMD64)] on win32
Type "help", "copyright", "credits" or "license" for more information.
>>> from reportlab.pdfgen import canvas as canx
>>> c = canx.Canvas('banner.pdf')
>>> # want red outline
...
>>> c.setStrokeColor((1, 0, 0))
>>> # inside of letters should be white
...
>>> c.setFillColor((1, 1, 1))
>>> c.setLineWidth(1.0)
>>> t = c.beginText()
>>> t.setTextRenderMode(2)
>>> c._code.append(t.getCode())
>>> c.setFont('Helvetica', 48)
>>> # origin is at bottom, left of page
...
>>> c.drawString(2 * 72, 7 * 72, 'DRAFT FORMAT')
>>> c.save()
>>>
>>>


Great, I've got a banner.


There is a whole bunch of stuff in that code segment that I'm leaving unexplained.  Not a big surprise, but to use a Python API to edit PDF's, you need to know something about the format.  This has been a huge learning experience over the course of a day or two for me.  What helped me most is the reportlab documentation.  After copying a code snippet and seeing that it worked, I could go back there and try to figure out how it works.  This learning experience is a work in progress.  There are things you pick up right away, though.  For instance, Adobe Reader comes with 14 base fonts of which Hevletica is one.  Who knew?  Not I!

My banner isn't quite the way I want it.  It's horizontal and I would like to tilt it to 45 degrees.  Google again to the rescue.  Some kind soul has already covered it on a blog.

Python 2.7.6 (default, Nov 10 2013, 19:24:24) [MSC v.1500 64 bit (AMD64)] on win32
Type "help", "copyright", "credits" or "license" for more information.
>>> from reportlab.pdfgen import canvas as canx
>>> c = canx.Canvas('banner.pdf')
>>> c.setStrokeColor((1, 0, 0))
>>> c.setFillColor((1, 1, 1))
>>> c.setLineWidth(1.0)
>>> t = c.beginText()
>>> t.setTextRenderMode(2)
>>> c._code.append(t.getCode())
>>> c.setFont('Helvetica', 48)
>>> c.saveState()
>>> c.translate(100, 100)
>>> c.rotate(45)
>>> c.drawCentredString(500, 100, 'DRAFT FORMAT')
>>> c.save()
>>>


Close enough.  Confession - I don't really think things through and measure with trigonometry what it will take to get placement right; I just "hack" until it looks about right.  This is a habit I should break if I continue to have to play with pdf's.





 
 
Now we'll merge the banner with some pages from another pdf to make a new document.  I'm going to use pages from the reportlab documentation because there's all kinds of work stuff in the pdf I generated above.

Python 2.7.6 (default, Nov 10 2013, 19:24:24) [MSC v.1500 64 bit (AMD64)] on win32
Type "help", "copyright", "credits" or "license" for more information.
>>> import PyPDF2 as pdf
>>> bannerfile = open('banner.pdf', 'rb')
>>> docfile = open('docfile.pdf', 'rb')
>>> outputfile = open('newfile.pdf', 'wb')
>>> readerbanner = pdf.PdfFileReader(bannerfile)
>>> readerdoc = pdf.PdfFileReader(docfile)
>>> writernewdoc = pdf.PdfFileWriter()
>>> pagesdoc = (readerdoc.getPage(x) for x in xrange(286, 291))
>>> for pagen in pagesdoc:
...     writernewdoc.addPage(pagen)
...
>>> writernewdoc.write(outputfile)
>>> outputfile.close()
>>> docfile.close()
>>> # now merge banner to pages of new file
...
>>> opaquebannerfile = open('opaquebannerfile.pdf', 'wb')
>>> testpagefile = open('newfile.pdf', 'rb')
>>> bannerpage = readerbanner.getPage(0)
>>> readertestpages = pdf.PdfFileReader(testpagefile)
>>> writeropaquebanner = pdf.PdfFileWriter()
>>> for x in xrange(readertestpages.getNumPages()):
...     pagex = readertestpages.getPage(x)
...     pagex.mergePage(bannerpage)
...     writeropaquebanner.addPage(pagex)
...
>>> writeropaquebanner.write(opaquebannerfile)
>>> opaquebannerfile.close()
>>> bannerfile.close()
>>> testpagefile.close()
>>>


It's not perfect, but it's essentially what I wanted (my centering of the banner could be better).



What if I wanted a transparent banner to emphasize the draft nature of the content rather than that of the format?

Python 2.7.6 (default, Nov 10 2013, 19:24:24) [MSC v.1500 64 bit (AMD64)] on win32
Type "help", "copyright", "credits" or "license" for more information.
>>> from reportlab.pdfgen import canvas as canx
>>> c = canx.Canvas('transparent.pdf')
>>> c.setStrokeColor((1, 0, 0))
>>> transparentwhite = canx.Color(255, 255, 255, alpha = 0.0)
>>> c.setFillColor(transparentwhite)
>>> t = c.beginText()
>>> t.setTextRenderMode(2)
>>> c._code.append(t.getCode())
>>> c.setFont('Helvetica', 48)
>>> c.saveState()
>>> c.translate(100, 100)
>>> c.rotate(45)
>>> c.drawCentredString(500, 100, 'DRAFT')
>>> c.save()
>>>
>>> # merge again
...
>>> transparentbannerfile = open('transparent.pdf', 'rb')
>>> testpagefile = open('newfile.pdf', 'rb')
>>> outputfile = open('mergedtransparent.pdf', 'wb')
>>> import PyPDF2 as pdf
>>> readerbanner = pdf.PdfFileReader(transparentbannerfile)
>>> readertestpages = pdf.PdfFileReader(testpagefile)
>>> bannerpage = readerbanner.getPage(0)
>>> writeroutput = pdf.PdfFileWriter()
>>> for x in xrange(readertestpages.getNumPages()):
...     pagex = readertestpages.getPage(x)
...     pagex.mergePage(bannerpage)
...     writeroutput.addPage(pagex)
...
>>> writeroutput.write(outputfile)
>>> outputfile.close()
>>> transparentbannerfile.close()
>>> testpagefile.close()
>>>


Not beautiful (I would make the banner font edge thinner), but it is indeed transparent.

 

The transparency part is the alpha value in the code for the color transparentwhite.  There is some sample code that shows how to do this on reportlab.com's site.

The last thing I needed to deal with was bookmarks.  I had some problems initially in that, although the bookmark showed up, it ended up at the bottom of the page underneath the SSRS tables and charts I was trying to reference.  I got around this by digging into the dictionary structure of the PyPDF2 Bookmark object.  Here is the (one line) function code:

OBJECTKEY = '/A'
LISTKEY = '/D'


FULLPAGE = '/Fit'

def fixbookmark(bookmark):
    """
    bookmark is a PyPDF2 bookmark object.

    Side effect function that changes bookmark
    page display mode to full page.
    """
    # getObject yields a dictionary
    props = bookmark.getObject()[OBJECTKEY][LISTKEY][1] = pdf.generic.NameObject(FULLPAGE)
    return 0


addBookmark is a method of the PyPDF2.PdfFileWriter object.  It takes a string name, a page index (zero based), and an optional parent PyPDF2 Bookmark object.  The references in my fixbookmark function "take" prior to writing the pdf to disk with the write method of the PyPDF2.PdfFileWriter object.

Mike Driscoll blogged about PyPDF2 a couple years back.  He's got a whole series on PDF's in fact (aside: the man is a pragmatic programming blogging machine).  There are good code snippets and pretty good comment threads on those posts for newbs like me.  I found that rl118.pdf doc useful for familiarizing myself with the pdf file format and constants used to reference objects within the file format.

This was a bit of an experience dump on my part.  If you've read this far, thank you for your patience and for having a look.



Tuesday, March 18, 2014

(Windows) LogParser - Install Without Admin Rights

A twitter acquaintaince @zippy1981 recommended the Window's software LogParser as a replacement for MSSQL bcp for my data transfer needs.  I downloaded the msi file from Microsoft and tried to install it.  As is true with a lot of software at work, I got a message saying the software can't be installed without admin rights.

I tweeted @zippy1981 (actually Justin Dearing in "real life") back saying I couldn't install.  He suggested using 7zip to decompress the msi file.  I downloaded 7zip portable and followed the instructions and ended up with files with names like these:

LogParser_dll.B1735C0B_1CB5_4257_8281_92109AE41CE6

The names are not handy for the executable, nor will they work, but they are easy enough to decipher - there's an underscore between the extension and a period following the filename with a long string of characters.

Here is the mini, somewhat clunky script I wrote for "fixing" the filenames (I used Python 3.3):


"""
Remove extensions from extracted
msi files.
"""


DIRX = 'C://UserPrograms//LogParserWorking//'

import os
import shutil

filenames = []

x = os.walk(DIRX)
# generator
for y in x:
    # lists of files
    for filex in y[2]:
        filenames.append(filex)


for filex in filenames:
    # rip off end
    # change _ to .
    print(filex)
    # reverse
    filey = filex[-1::-1]
    # strip
    dotx = filey.find('.')
    filey = filey[dotx + 1:]
    # replace underscore
    underscore = filey.find('_')
    firstpart = filey[:underscore]
    firstpart += '.'
    secondpart = filey[underscore + 1:]
    filey = firstpart + secondpart
    filey = filey[-1::-1]
    print(filey)
    shutil.move(DIRX + filex, DIRX + filey)


And voilá - I've got LogParser without having to bother our IT people for an install.
 
I'm probably late to the party on this msi extraction concept.  Still, I thought there might be other people who are as unaware of it as I was, so I'm blogging it.  Thanks for having a look.