Friday, September 30, 2011

Biological database PublicHouse visualized with MySQL Workbench

"PublicHouse is a publicly queryable set of biological databases constructed using the BioWarehouse biological database warehousing system." In other words it is a project, merging all kinds of biologically related databases into one central relational database which encompasses tens of millions of entries.


The project was initiated and maintained by SRI's Artificial Intelligence Center, which is known best for its DARPA related projects. SRI, or the Stanford Research Institute is one of the world's largest contract research institutes.


Publichouse can easily complete at least one quarter of one's data mining 'curriculum'. This is based on a very rough estimate of the time publichouse is saving me for one of my research projects. Whilst their SQL server is indeed quite fast -though 'demanded' at times, to put it mildly- their user frontend and webserver has seen better days. On the menu of web-interfaces is phpMyAdmin, which is quite wasteful with application-resources to begin with.


Fig 1.a showing an excerpt (ca ~8%) of the relational database tables of publichouse



As such a relationship map is quite useful, especially with the option of quickly navigating to the table of interest by searching for a keyword within the browser. This option becomes possible when viewing the map as an SVG (uncompressed 5MB).




Click here to view the Database as png (1.9MB)
Click here to download/view the Database as svg.gz (1.9MB)
Click here to download/view the Database as svg (5MB)
Click here to download/view the Database as pdf with captions (2MB)
Click here to download/view the Database as pdf without captions (2MB)


Note: There is a minor bug in MySQL workbench. When changing the model settings, you will have to change the Model>Relationship Notation in order to 'commit' the change, and then revert back to your Relationship Notation of choice.


Crow'S foot's Notation is used:
Fig 1.b overview of the most important relationship symbols in crow's foot notation of (RDB) tables. (Adapted from  http://ironbark.bendigo.latrobe.edu.au/subjects/IS/sem12007 )






Links:

Wednesday, September 28, 2011

Austrian police-force data visualized on google maps (Update)

Introduction:
Recently a group named Anonymous Austria (twitter: AnonAustria) released a document containing almost 25.000 entries of police-men and women along with their addresses.
Whilst the data is questionable,and I don't necessarily approve of the way the data was made available, I am a sucker for all kinds of data.

Below you can view a plot of the distribution of Austrian policemen and women:


Data entries clearly align with cities and their suburbs. Vienna and Graz are a challenge in itself to overlook, given the size of the clusters. However, the data is clearer to interpret and of  greater value when plotting the difference of the population density to the density of servicemen. That being said, I haven't found raw population density data yet.

Age heatmap:
The primary intend wasn't to capture the distribution of the police force in Austria. But it is admittedly fascinating to see how many well-serviced, semi-translucent police cars, all frozen in time, are driving around -collision free!- on google maps. What's really interesting would be a heatmap of age groups, that is where the most senior policemen and women are located.
Such a map could greatly sharpen one's deductive powers on calculating the risk of getting caught in a pursuit with a senior serviceman, as the ensuing aftermath to the stolen candy from the local grocery store. Candy, so I've learned as a kid, is important in keeping blood sugar balanced - no matter the delicious means.
Read more about the age distribution under the paragraph data investigation. Initially the idea was to assign the color green to the age group 10-20. It turns out 10 year olds really don't like being with the police force (even a decade later!), and as such only two data entries are assigned to this category.



Color->weight fixed. blue:10-20yrs, green:20-30yrs, yellow:30-40yrs, amber:40-50yrs, red >60yrs;


Looking at the data, it appears that senior service-men and women are actually present in greater numbers in suburbian and/or country regions as well as near the border. To me, this was an astonishing result. Let me know how you would interpret this.

Implementation:
The data has been translated into geo-coordinates, a process known as Geocoding. Google is offering a serice for free, under the fitting moniker 'GeoCodeService'. You can try this right here for the address
'Schanzgasse 14 3712 Maissau' - which is located in Austria, put that's a quest for Sherlock Google to figure out.
The url contains a fixed callback function, embedded in the google maps api, a token for the service authentication  and the address - url-encoded in this case.
https://maps.googleapis.com/maps/api/js/GeocodeService.Search?4sSchanzgasse%2014%3B%203712%20Maissau&7sUS&9sen-US&callback=_xdc_._jccbik&token=129739

The heatmap is simply composed of an alpha-png, with each color being assigned to a value range for one parameter - in this case the parameter is age. Much more sophisticated overlays can be created, such as isolines and isosurfaces, alongside full JavaScript interactivity.

Data investigation:

Few have actually looked at the age distribution of the published data. The data used herein  is a random sample of  2765 entries of roughly 25000 entries, with no significant difference in the age groups (p<0.05).  As such the sample is representative for the age-group distribution of the entire set.
The anonymized data set has been uploaded to my data-hub repository, containing useful test data sets ranging from image and video processing to web development (curation/migration of the data to the web is still ongoing).
Below is a list with the number of servicemen in decades, with 0-10 being at the very left. Unsurprisingly there are no underage servicemen in Austria (-the foreign minister will be pleased).
There are however only 2 entries for the age category 10-20, which would hint at the data being heavily skewed or biased. The origin of the data is still not known. Yet on the contrary the skewness of the data may hint at plausible sources.
Age distribution of the general Austrian population peaking at the age class 40-50yrs

Age distribution of the data set, created w. R:
plot(seq(0,100,10),c(0, 2, 214, 477, 1005, 729, 255, 36, 29, 17, 0),type="b",xlab="Age[yrs]", ylab="Frequency[entry]")

a = [0, 2, 214, 477, 1005, 729, 255, 36, 29, 17, 0]

In percent of numbers, results are quicker to take in:
a.map( function(i){ return (i/a.sum()*100).toFixed(2)+'%';} )
["0.00%", "0.07%", "7.74%", "17.26%", "36.36%", "26.37%", "9.23%", "1.30%", "1.05%", "0.62%", "0.00%"]


Half of the data entries distribute to the age-group 40-50yrs,  and if the age group is expanded to 40-60yrs the group will cover over three quaters of the data set. It is unlikely that this is a list of all policemen and women in active duty. Judging from the age groups, either the list is biased due to the nature of the data source, or the list is indeed unbiased and the skewness is a result of social factors (no kidding!).

Sharing the data: 
You can share the data through these two links:
  • Showing the distribution:
    • http://lsauer.github.com/scripts_n_snips/poldistr.html
  • Showing the age heatmap
    • http://lsauer.github.com/scripts_n_snips/poldistr.html?heatmap
To embed the maps in a website you could use the following code:


Conclusion: 
Most interesting to me would be establishing connectivities based on the (currently) quite solid results of search engines such as google. Covering the entire connectivity space,  would mean (n-1)^n  queries would have to be invoked.
For 25.000 entries thats 600 million queries. At a rate of 20 queries per second, this would amount to pretty much one year (347.2d).If multiple entities are packed in one query, which from my experience is sensible for a maximum of four entries (google), the time can be cut down to a quarter. If moreover the queries are restricted to a single city, an intercity-wide network of people can be created in just a few days.


In one instance  a link was provided for a male and female officer, both of whom were living in the  'Oedenburgerstrasse', except the former in '7062 Burgenland' and another in '1210 Wien' i.e. dozens of kilometers apart. Whilst 'Oedenburgerstrasse' does indeed exist in both provinces, it may be that the data is simply wrong. The difficult part is scoring the google data-result, with the intend of calculating a testable hypothesis. One important metric is simply the string distance of two queried entries, and afaik google doesn't provide this yet. Much more sophisticated algorithms have been developed.

Generating such networks of people based on scored data which is mined from the web, could prove very insightful in missing children's cases, where details may have been overlooked. Details which manifested on the web in hindsight of preceding or currently emanating information, as part of the crowd-intelligence.

Edits:

  • Added the age distribution of the austrian population from wolfram alpha
  • A comment and a twitter post (fatmike182) informed me that the data is supposedly from a police club, which explains the divergence of the data from the expectation value based on the general population.
  • 30/10/11: Assigned the color-weight and size of the green overlay to the same values as the other colors, for a more realistic representation.

Note:
Unfortunately, google's TOS don't approve of data mining strategies. But google does provide a rich service infrastructure which in some cases can nullify the necessity for data mining altogether.


All sources have been put on github. All data is anonymized.

Monday, September 26, 2011

Ent - Plotting entropy and computing string metrics

Introduction:
Looking at statistical plot which shows a metric such as the entropy of the contents of a files or a large data stream/string is often the fastest way of  roughly estimating the file's/string's contents.

Ent is a command line utility which can take data streams from the standard-in (Stdin), arbitrary data via pipes and/or files which have to be passed as the first parameters. Ent analyzes these and does so by calculating the individual file's data metrics. It then plots the accumulative metrics which are yielded by sequential concatenation of the data (e.g. the files, hence file order in the console parameters matters).

Ent with a text file containing ASCII graphics. regions with text and those with graphics can be discerned.

Ent using several files. The plot encompasses the accumulative data.


When no parameters are passed or '-h' or '?' the following command line usage will be displayed


╬===================================================================╬
| Ent v09/2011 Entropy calculator & string metrics; lsauer - univie |
╬===================================================================╬


Usage: shantropy [<filename1> <filename2>...first params!] [-o <outfile>] [-h he
lp]
[-e efficiency] [-m 1,2.. 1st,2nd order markov] [-b base-alphabet]
[-w width plot] [-h height plot] [-s <string> as last param!]



Parameters:
-e plot the efficiency of the data
-w <int> sets the width of the plot
-h <int> sets the height of the plot
-o outfilename ...will plot data to a given file and create or append to the file
use '>' myfilename.out to capture the entire stdout
-e ...plot the efficiency
-m 1/n.. first / n order markov source (int: n...number of linked characters)
-b <decimal> - "b-ary entropy", choose a different base, default is 256 for ASCII; use 64 for text

The most up to date information on usage is best accessed via the program's source header:
https://github.com/lsauer/entropy/blob/master/shantropy/Main.cs

Usage examples:

  • ent explain.nfo markdownsharp-20100703-v113.7z -b 3,6
  • cat myfile | grep '\d+.+?\d' | ent  -b 2.15 -s
  • ent .\json_testfiles.tar .\cc_by_sa.png .\vmdscene.bmp -e -w 80 -h 12 -m 1
  • ent json_testfiles.tar cc_by_sa.png vmdscene.bmp -e -w 80 -h 12 -m 1


About:

The purpose for this command line tools was to quickly analyze sequence files as well as serialized JSON data (at the time of writing the application). Since the command line along with autocompletion (by pressing Tab) is often times faster than graphical tools and has virtually no startup times, it is my preferred choice for analyzing data.

The tool was written in C#, and comes with cross-platform build files along with project files for Monodevelop.
Ent runs on Linux, MaxOS, Android and Windows.
See Mono for more information.

Notes:
The code is in disarray and will likely be rewritten, after feature completeness is reached. Consider the tool an early alpha build, far from the intended feature set. In bioinformatics many situations arise in which the command line is a very powerful and helpful way of quickly interacting with data, without the need of getting to know complex user interfaces with unfamiliar design.

Ent is able to quickly provide insight in the nature of a data stream or text based file. e.g. structure, even hierarchy, compression/encryption as one will familiarize with certain patterns

The project can be downloaded or forked here:
https://github.com/lsauer/entropy

Installing:
In order to access the program anywhere on the command line you have to add it to the $PATH (linux) or PATH (windows).

Download:



Next version:  

  • png plot output
  • grace plotting script output (end of  '11)
  • PE (portable executable) header inspection w. FPU opcode density (based on byte stream from D8 to DF - windows!) (eventually)
  • Simpsons D and E (Diversity index and Evenness) (end of the month)
  • Smith–Waterman Distance (soon)
  • Hamming distance (soon)
  • eventually Soundex
  • automatic text-mode toggling based on a score (alphabet size, file extension and text characteristics) 
  • option of data range selection (soon)

Resources:


I conclude with a plot of the augustus library for Clamydomonas reinhardtii, which is notoriously highly entropic and additionally non-redundant. You can get the file here.

0,50 | ▓▓▓▓▓▓▓▓▓▓▓▓▓▓▓▓▓▓▓▓▓▓▓▓▓▓▓▓▓▓▓▓▓▓▓▓▓▓▓▓▓▓▓▓▓▓▓▓▓▓▓▓▓▓▓▓▓▓▓▓
0,45 | ▓▓▓▓▓▓▓▓▓▓▓▓▓▓▓▓▓▓▓▓▓▓▓▓▓▓▓▓▓▓▓▓▓▓▓▓▓▓▓▓▓▓▓▓▓▓▓▓▓▓▓▓▓▓▓▓▓▓▓▓
0,40 | ▓▓▓▓▓▓▓▓▓▓▓▓▓▓▓▓▓▓▓▓▓▓▓▓▓▓▓▓▓▓▓▓▓▓▓▓▓▓▓▓▓▓▓▓▓▓▓▓▓▓▓▓▓▓▓▓▓▓▓▓
0,35 | ▓▓▓▓▓▓▓▓▓▓▓▓▓▓▓▓▓▓▓▓▓▓▓▓▓▓▓▓▓▓▓▓▓▓▓▓▓▓▓▓▓▓▓▓▓▓▓▓▓▓▓▓▓▓▓▓▓▓▓▓
0,30 | ▓▓▓▓▓▓▓▓▓▓▓▓▓▓▓▓▓▓▓▓▓▓▓▓▓▓▓▓▓▓▓▓▓▓▓▓▓▓▓▓▓▓▓▓▓▓▓▓▓▓▓▓▓▓▓▓▓▓▓▓
0,25 | ▓▓▓▓▓▓▓▓▓▓▓▓▓▓▓▓▓▓▓▓▓▓▓▓▓▓▓▓▓▓▓▓▓▓▓▓▓▓▓▓▓▓▓▓▓▓▓▓▓▓▓▓▓▓▓▓▓▓▓▓
0,20 | ▓▓▓▓▓▓▓▓▓▓▓▓▓▓▓▓▓▓▓▓▓▓▓▓▓▓▓▓▓▓▓▓▓▓▓▓▓▓▓▓▓▓▓▓▓▓▓▓▓▓▓▓▓▓▓▓▓▓▓▓
0,15 | ▓▓▓▓▓▓▓▓▓▓▓▓▓▓▓▓▓▓▓▓▓▓▓▓▓▓▓▓▓▓▓▓▓▓▓▓▓▓▓▓▓▓▓▓▓▓▓▓▓▓▓▓▓▓▓▓▓▓▓▓
0,10 | ▓▓▓▓▓▓▓▓▓▓▓▓▓▓▓▓▓▓▓▓▓▓▓▓▓▓▓▓▓▓▓▓▓▓▓▓▓▓▓▓▓▓▓▓▓▓▓▓▓▓▓▓▓▓▓▓▓▓▓▓
0,05 | ▓▓▓▓▓▓▓▓▓▓▓▓▓▓▓▓▓▓▓▓▓▓▓▓▓▓▓▓▓▓▓▓▓▓▓▓▓▓▓▓▓▓▓▓▓▓▓▓▓▓▓▓▓▓▓▓▓▓▓▓
0,00 | ▓▓▓▓▓▓▓▓▓▓▓▓▓▓▓▓▓▓▓▓▓▓▓▓▓▓▓▓▓▓▓▓▓▓▓▓▓▓▓▓▓▓▓▓▓▓▓▓▓▓▓▓▓▓▓▓▓▓▓▓
------------------------------------------------------------------
     0%        16%        33%        50%        66%        83%

Sunday, September 25, 2011

javascript: binary to int, hex, decimal - quick number conversion with one line of code

Here is code snippet for converting a number to its binary notation:

JavaScript's toString function takes the base to which the function should convert a string to.... this makes the JavaScript console a great choice for tinkering with binary operations.

Friday, September 16, 2011

KeyBoarder - making dynamic keyboard shortcut rendering appealing and interactive


Update: The KeyBoarder's project website is at http://lsauer.github.com/KeyBoarder
. All resources of this site are freely licensed under CC-BY-SA 3.



Keyboarder automatically converts textual references of keyboard shortcuts into graphical representations. The javascript is fast, small and has no other dependencies (such as jquery). The idea behind KeyBoarder is that updating past or new documents with static markup-clutter is tedious, put mildly. Imagine updating hundreds of blog posts and tens of html-manpages.

Keyboarder provides event binding of individual keys and css-designing. When loaded the user can navigate documents by shortcut simply by pressing the corresponding keys.

KeyBoarder may be ideally suited for blogs, documents and help files which contain lots of keyboard shortcuts. Only two lines of code are required, one for importing the css-file containing the styles, and the other for importing the JavaScript into the HTML document.



KeyBoarder takes as configuration arguments the css class-name of the HTML elements holding the main article's or document's content with key shortcuts. (e.g. a blog post). Often such a class is aptly called 'content', and as such is part of the default parameters.

If you press a key which is part of a shortcut, KeyBoarder will highlight and focus the key-shortcut, and upon each further press will go to the next consecutive element that contains the pressed key in the key-shortcut.
This is particularly suited for navigating long texts, like a blog or a help-file / html-manpage. Keys can be bound to other events, and rendered individually based on per-key-type css-rules .

From the design perspective, KeyBoarder features several appealing styles to choose from:
BASE
MODERN
Fancy
Dark
Light
Simple
BASE
MODERN
Fancy
Dark
Light
Simple

KeyBoarder uses the kbd element of html, which allows device specific native treatment by the browser of the client.

In order to recognize keys, KEYS must conform to certain conventions: They must be written in UPPERCASE or Startcase and concatenated by a symbol e.g. '+' or '-', '~', unless otherwise specified:
A minimum number of consecutive keys can be configured, in order for shortcut rendering to only take place, once that number is met. e.g. rendering ALT+X but not ALT

Keys can be designed with css, and several default design are included.

Examples for writing Key-shortcuts which KeyBoarder recognizes:
This is a simple shortcut ALT+WIN+U 
ALT + WIN+ P /*this is fine*/!
somewhere in the text, ALT + RETURN - a key shortcut!
...in order to enter the text press, STRG+ ENTER
you can set different styles in preformatted tex
For demonstration purposes, provided here is an excerpt of the shortcut key table for Open Office Writer:

Shortcut Keys Effect
Ctrl+A Select All.
Ctrl+J Justify.
Ctrl+D Double Underline.
Ctrl+E Centred.
Ctrl+F Find and Replace.
Ctrl+Shift+P Superscript.
Ctrl+L Align Left.
Ctrl+R Align Right.
Ctrl+Shift+B Subscript.
Ctrl+Y Redo last action.
Ctrl+0 (zero) Apply Default paragraph style.
Ctrl+1 Apply Heading 1 paragraph style.
Ctrl+2 Apply Heading 2 paragraph style.
Ctrl+3 Apply Heading 3 paragraph style.
Ctrl+5 1.5 Line Spacing.
Ctrl+Enter Manual page break.
Ctrl+Shift+Enter Column break in multi-columnar texts.
Alt+Enter Inserting a new paragraph without numbering.
Alt+Enter Inserting a new paragraph directly before or after a section or a table.

Keys can be nested:
STRG - ALT - DEL //Single key reconition mode is set, thus the last key 'Del' is recognized but not the others, as the current concatenator symbol is set to '+'
STRG+ WIN+O
STRG+P+D WIN T , or RETURN X should not be rendered
CTRL+ALT+O


In the following examples, keys will not be rendered due to varrying issues (with KeyBoarder v.7 most are rendered):

STG+ALT //missspelling

STRG+OI //missing concatenator
ALT+G�Z //illegal symbol: non recognized concatenator
"ALT + DEL"
CTRL+ALT - DEL
,ALT+DEL",

Tuesday, September 13, 2011

Proxy-Painting - Visitor location maps as a canvas or Pacman's world tour

Introduction
There are many web-services out there which provide small code snippets, to let you track your visitors to your site, and in turn these services track you and your visitors, pouring bits into their own data mine,... but that's a different story.
The tracking relies on look up tables which link IP -domains or more often IP -subnet ranges to locations. This has been working well for almost a decade since the advent of comprehensive lists which link a location to various IP addresses. With the arrival of mobile-internet, and crowdsourcing, and great interest (entailing big $$) the detail of location assignment to IP addresses resulted in a resolution at the street-block level. The general approach is called geotracking or geotargeting.
Recently a geolocation API was standardized, allowing browsers (which translates to IP-addressed computer terminals) to reveal their location on an per-request opt-in basis.

The service is part of the navigator object and  accessible via javascript as:   

navigator.geolocation

Geolocation
  1. __proto__Geolocation
    1. clearWatchfunction clearWatch() { [native code] }
    2. constructorfunction Geolocation() { [native code] }
    3. getCurrentPositionfunction getCurrentPosition() { [native code] }
    4. watchPositionfunction watchPosition() { [native code] }
    5. __proto__Object


getcurrentposition takes two functions, the first is called to pass along positional information in form of one parameter, and the second is called when an error occurs

navigator.geolocation.getCurrentPosition(fnPos, fnErr)

As such entering the following command on your Javascript-console will initiate geotracking in the browser, with a navigation-bar request turning up:

navigator.geolocation.getCurrentPosition(console.log, console.error)



The passed object for instance called 'pos' has the properties
pos.timestamp
pos.cords.longitude
pos.cord.latidude 


Proxy-Painting
The interesting idea is that in principle these ip2location tables can be used in reverse by routing dummy-requests to proxies based on a given location. One could then write all sorts of stuff (Pacman comes to mind) , onto these maps, provided the provided spatial resolution of the service is high enough. A sufficiently populated area which has enough proxies present could be re-purposed into  a canvas. Since so far I couldn't find anything similar I would call it a proof of principle.

The high enough resolution limits the kind of services somewhat to http://phpweby.com/service/visitormap/
, except that it does not offer heatmap-coloration, thus rendering the service useless for this purpose.

There are a few requirements though:

  • mediocre site traffic (i.e. not amazon)
  • a homogenously IP-address-populated area
  • sufficient number of proxies (/zombies?)
  • low resolution graphics on the order of 10x10pixels (depending on the visitor-location-map service)
  • coloration of 'visitor-pins' on the map based on a heatmap or other metric which is accessible for manipulation


#constants
There are a few points to determine first:

  • target: a site with mediocre site-traffic
  • offset: an initial well populated location (lat/long)  - as so often Las Vegas might be a good start
  • graphic: a low-res icon or word(s)
  • ip2location: a lookup table providing fast translation of a location to an IP-address
  • proxy-list: a comprehensive up-to-date list of computer terminals which can be used as proxies
  • proxip2location: the resulting list when filtering the ip2location list by the proxy-list 
  • granularity: the longitude-step and latitude-step which can be found by the following method
The proxyip2location table is translated (from spherical coordinates) into cartesian x (column = longitude), y (rows = latitude) and read into a matrix, thus spanning a world-map.

Now simple image based algorithms can be used, like a 3x3 kernel filter or a 3x3 raster for seeking the greatest density and greatest homogeneity (from a given raster).
The greatest density for instance could be computed by simply seeking for the greatest standard deviation from a 3x3 raster
The greatest homogeneity is the exact opposite, with the least standard deviation from a 3x3 raster

If there too many points in a given raster the raster-size has to be adapted, and additionally each point can be 'probed'. By removing each point and recomputing the homogeneity the point can be probed whether it contributes positively or negatively to the homogeneity within the raster.

Raster positioning is another crucial point, with many ideas to choose from, greatly affecting speed and accuracy.

Taken as a whole, this method allows very easily assigning a middle-ground that fits the actual parameters of the actual location distribution of computer-end-terminals. After all life is about compromise, and so is Pacman. An end result that can be interpreted as a Pacman-on world-tour (through an artistic-lens) is a good start.


#implementation

A 10x10 black and white graphics is read into a matrix and multiplied by the canvas matrix of suitable latidue and longitude points.
Each row is processed by a unique thread. Each element within the thread and thus within each row is translated into an IP address and routed to proxy on the other side. (An additional optimization would be  to flag proxies which perform a request at the target even with a HTTP-header set to Keep-Alive:  timeout=100, max100 whilst subsequently terminating the connection)

After each thread has looped a given number e.g. 1000, each thread is cancelled. Once all jobs are finished worker process initiation starts again in order to maintain syncrhonicity (for the sake of uniformly colored heatmap-pins on a given worldmap).

Saturday, September 10, 2011

JavaScript: Simple multi-timer for logging performance of an html page

Here is a simple timer for quickly logging performance between several crucial points of a given document, if one doesn't want the complex functionality and overhead that comes with code-profiling.


Urban: School is that way...


Javascript: document.onload vs. window.onload and multiple events

Depending on where the window.onload function is placed  (head or body tag-element or any other further down), the window.onload event will shift in time and might be called twice (Chrome >v12, once in FireFox, once in Opera).

Browser behavior differs somewhat with internal event binding seeemingly slightly different in different browsers. Generally it seems that if window.onload is placed within the head -tag element, it will trigger once the DOM tree is loaded, and a second time once all content is loaded.


The general idea between a window.onload and document.onload event differentiation is that window.onload would fire when when the window is ready for presentation and document.onload, once the DOM tree (built from the markup code within the document) is completed.

Ideally this allow offscreen-manipulations with Javascript, incurring virtually no additional CPU load due to re-rendering. On the contrary window.onload can take a while to fire, when multiple external resources have yet to be loaded, and CPU-load intensive rendering could already have occured.

I wrote the following code for investigating and timing the behavior:


Also take a look at MDN's description of window.onload:

The load event fires at the end of the document loading process. At this point, all of the objects in the document are in the DOM, and all the images and sub-frames have finished loading.

There are also Gecko-Specific DOM Events like DOMContentLoaded and DOMFrameContentLoaded (which can be handled using element.addEventListener()) which are fired after the DOM for the page has been constructed, but do not wait for other resources to finish loading.

Wednesday, September 7, 2011

PHP: Regular expressions - avoid pattern quantifiers in the preg_match_all

Preg-match is the Perl regular expression-match engine for PHP.



It turns out that, additional match quantifiers should be avoided.

To demonstrate this, follow this code snippet:

The return value of preg_match_all in this case was 65, basically a count of each character of the input.



count($matches) will correctly yield '2', as empty array-elements are not counted.



Note: preg_match_all is a shortcut function which adds an quantifier to the match. As such the correct pattern will be /(....)/ not /(....)*/

Tuesday, September 6, 2011

Syntaxhighlighter Mirror DOWN at alexgorbatchev.com - Script Render Fallback

With the current outage of syntaxhighlighter at alexgorbatchev.com  I searched for alternative mirrors, and realized how many people now solely rely on alexgorbatchev.com as a CDN (content deliver network) to directly load the scripts for their syntax-highlighting needs from the 'current'-fork of SH. After all the site itself promotes that use, however didn't stand up to the expected quality of service one would expect.

One crucial problem which arises in a case of an outage of remote resource delivery is that with the newer method of packing scriptlets into actual script tags, nothing will be shown without the Syntaxhighlighting-script included. There is simply no fallback!
In this newer method of scriptlet presentation the scriptlet content is embedded in XML conforming CDATA container within script-tags. In principle most users must rely on this method if a basic convenience of editing character rich code and XHTML conformity is to be upheld.

Fallback if no Syntaxhighlighter is present:

Note: If you just need a quick working solution, use this shAutloader.js instead, which already includes the Fallback functionality.
<script src='http://sites.google.com/site/xhr2code/shAutoloaderSite.js' type='text/javascript'></script>


An apropriate css-class would be for instance...


I quickly came up with a FALLBACK code-snippet, which you would best include right after you invoke the Syntaxhighlighter-Render like this:

SyntaxHighlighter.HighlightAll();
or
SyntaxHighlighter.HighlightAll('code');


Here is a better readable version of the programm





Just include this instead of dp.SyntaxHighlighter.HighlightAll(). You can style the fallback pre-tags by setting the variable elClass to a string containing the classname. Otherwise you can directly add different Fallback classes to the script tags containing your scriptlets.


PHP: Array_merge by value, or array merge by values and keys

PHP comes with a convenient function for merging an array by its index, called array_merge . Unfortunately there isn't a straightforward function for merging an array by values.

Here is an implementation, which works if the indexes are irrelevant. Since it relies on native function's it is still faster than a solution with scripted-logic. I also discovered a curiosity/"bug" in the PHP parser:

Monday, September 5, 2011

'is' - A lightweight JavaScript-library for validations and checks

There are large web-frameworks such as dojo and extJS, which even at their bare essential introduce a great deal of excess code when only a tiny subsets of function's are needed. As such the rule of thumb is to not include more than one library.

I was interested in a very small library which provides a good starting point for any new script, which barely  introduces any overhead, and is easy to manage.

Here is the work in progress. You can download or fork it on github.
It's primary goal is to remain lightweight (~3kB) and useful.





Here are some code examples:

Friday, September 2, 2011

SOAP - WSDL: KEGG Error - Not an Array reference

SOAP - WSDL is a powerful but often dated form of transmitting data types over HTTP. It is still popular in enterprise and academic settings, counterpart to the increasingly popular JSON RPC. Since SOAP is an XML based language the overhead incurred by validating, parsing and encapsulating data within an XML-like format is far greater than the overhead which results from much simpler JavaScript Object Notation (JSON). Since JSON already features a highly isomorphic data structure to script- and database -based data objects, the overhead in parsing has less of an impact as opposed to XML objects. Additionally the format is much more flexible, but can just as easily be validated against a standard template.

KEGG, the Kyoto Encyclopedia of Genes and Genomes, is one of the most widely used biological databases in the world which stores and links hierarchical information ranging from genes, functional proteins to metabolites and entire pathways. KEGG 's fame is owed to its comprehensive biochemical pathways datasets, and by providing open, unimpeded direct access to its database resources. After all the SOAP interface is now a decade old.

KEGG holds many samples (tested with SOAPpy - which I ported over to version > 2.7 of Python. A port for version 3.x of python will eventually be released.)

Kegg's rich API reference is not skimpy with SOAP-python/perl examples: http://www.genome.jp/kegg/docs/keggapi_manual.html

Most of the examples provided, work as intended as long as the passed parameters are primitive types like boolean or string. However the treatment of object and arrays, which are nested types differ in various XML Schema specifications.

The following remote function from the KEGG server, will demonstrate the treatment of complex types:


PHP script initially used (error handling not shown):
Wenn using PHP Soap .dll/.so extension the remote pearl script of the KEGG-SOAP server will throw an error.

on the left python with SOAPpy - exactly as featured on the KEGG API reference
on the right: PHP with SOAP extension, showing that the SOAP client correctly  encapsulated the data as defined in the WSDL file for the function get_pathways_by_compounds.

With the XMLSchema set to the 2001 specification:
<xsd:schema targetnamespace="SOAP/KEGG" xmlns:xsd="http://www.w3.org/2001/XMLSchema">
the PHP SOAP Client will return the following response:
  Uncaught SoapFault exception: [SOAP-ENV:Server] Can't use string ("cpd:C00033") as an ARRAY ref while "strict refs" in use at /usr/local/WWW/pub/kegg/soap/private/v6.2/SOAP/KEGG_PATHWAY.pm line 184

When the PHP SOAP Client is set to the 1999 XMLSCHEMA definition (xsd):
<xsd:schema targetnamespace="SOAP/KEGG" xmlns:xsd="http://www.w3.org/2001/XMLSchema">
the PHP SOAP Client will return the following response:
Uncaught SoapFault exception: [SOAP-ENV:Server] Not an ARRAY reference at /usr/local/WWW/pub/kegg/soap/private/v6.2/SOAP/KEGG_PATHWAY.pm line 184.

When the XML SOAP is set to SOAP 1.2 the PHP script quits with a Fatal error: Uncaught SoapFault exception: [SOAP-ENV:Client] Content-Type must be 'text/xml' instead of 'application/soap+xml' in...

The transmission request for SOAP 1.2 will then look like this:


Testing with several SOAP applications didn't work either, when a simple array was passed to get_pathways_by_compounds. Testing with the open source php-library nusoap, provided a crucial hint: an empty XML array, even though an array containing two elements was passed to the function get_pathways_by_compounds on different implementations of SOAP Clients.
Showing the nusphere SOAP request. The clue-line is highlighted. (wireshark 1.6)


This suggested, in light of the initial investigation, that the first array gets reduced to a SOAP Array Container (as it should). However the pearl implementation on the SOAP endpoint, seems to require another layer of XML-nesting for the parameters passed to the remote function. Indeed, by merely nesting the actual array object in another array within the PHP script a response similar to that of SOAPpy is sent, around which the remote perl script is built.
Then again if the KEGG-implementation were open source, a much larger development community would have established around the project. Additionally looking at at the script running at the SOAP endpoint, which triggered the error in the first place, would have let me directly understand the error in the outlined context. The solution:

Here is a cheat-sheet for the XMLSchema 2001


Conclusion: Neither the perl scripts at the SOAP endpoint nor the featured SOAP libraries described in the KEGG SOAP reference as compatible, have been maintained for quite some time.The narrow focus of supported SOAP libraries and lack of open-source access to the project's functional implementation allowed this faulty behavior to persevere for almost a decade. The WSDL specification declares its own data type 'ArrayOfString' for the function, which gave a clear picture after investigation of the HTTP traffic, that the PHP SOAP extension was correct (as it should, given that it is a plugin which is distributed with the PHP package).

Apparently, some people struggled before me..

Python based web-servers and webware: Installing mod_python on Apache (Windows)

First download the following package file, which is licenced under the GPL.

There are a few steps for a successful installation:

Be aware that due to the lack of recent package releases of mod-python an Apache 2.0x server is required, otherwise the shared object link library (.so) cannot resolve all addresses which are required to meet recent Apache .so interface specifications !

  • Download the register.py file from here and put it in the one python-application-directory, which is intended to be registered (this will also allow using any 2.x Portable Python version). Open a command line and execute register.py
  • From the zip file, copy the apache folder into your apache server directory, and the lib folder into your registered Python application directory.
  • unpack setup.py and execute it (double click). When a location-request pops up, select your apache directory.
  • Add to your httpd.conf (Apache/conf directory) the following line:LoadModule python_module modules/mod_python.so
  • in your ScriptAlias / CGI section of httpd.conf add:
    <Directory "<Your Document Root>/python">
       AddHandler python-program .py
       PythonHandler mptest
       PythonDebug on
    </Directory>
  • a sample program (with the extension .py in your www directory) may look like this:
    from mod_python import apache
    
    def __main__():
       req.content_type = "text/plain"
       req.send_http_header()
       req.write("Text output via apache CGI.")
       return apache.OK

That's it. You can visit modpython.org for general information. The last release was almost 5 years ago.

You may also visit http://www.webwareforpython.org/ for a more recent project, which is based on Python, and includes support for many recent web technologies like JSON. A python based web server is already included. 

Showing the installation path of Webware for Python. Src:http://www.webwareforpython.org/Screenshots/ 



Thursday, September 1, 2011

Packet Sniffing to compensate for Chrome's lack of a proper Download Manager with 'advanced' download info

In Chrome external downloads will not be handled by the web inspector. As such it will be problematic  to get the download url or even referrer. This leaves one forced to use an external download manager or as I have found myself recently, in a situation where I used a simple packet inspector program just for getting the HTTP header of my download.
I generally use Nirsoft's smartsniff whenever I am in an windows environment and need a lightweight packet sniffer, and Kismet on linux. When I really need it I use wireshark (once known as etheral), which is quite a memory hog, and the reason why I use it scarcely.


A first choice for me when doing lightweight inspection of TCP, UDP and sometimed ICMP traffic. 

Like many resource friendly applications, smsniff relies on the old COM Controls and COM Dialogs (which themselves are based on GDI32 for rendering user dialogs)....
Applications:
In what situations do I use or need packet inspection:
  • Exposing the full transport header
  • Timing and duration of the packet traffic
  • actual size of the traffic vs. that reported in the program
  • looking into sources of communication where there should be none
  • Monitoring WLAN traffic for security related purposes
  • looking into backdoor communication
  • Reverse engineering of networking applications 
  • and especially DEBUGGING your own or third party networking-applications
Whireshark I mostly use when I am tinkering with (binary) protocol development or establishing new (binary) protocols initiation. In a recent case, I used wireshark for building a protocol on top of binary web-sockets. But any situation which is a)time-consuming and b) classifyable as profound will not be met well by Kismet or Smartsniff.