Wednesday, May 30, 2012

JavaScript : Copy text to the clipboard, without Flash and securely (Updated!)

Github (Alexa: #348), Kickstarter (Alexa: #774), and many other major web-sites are increasingly using a flash-based button to facilitate one-click-copying of text, an otherwise disallowed action. Put aptly, by several stackoverflow-users, this method is 'overkill'. Basically a translucent flash applet is overlayed on a button-graphic to perform a clipboard-copying action, upon clicking the applet. Questionable security behavior is thus proxied to the flash applet, as the post-flash era draws near. Some browsers support document.execCommand('copy') ,which allows placing content, marked editable, into the Clipboard. However, the methodology dates back to the IE-5 era, which is more than a decade past, and execCommand is only loosely supported, yielding mixed results.
Jarek Milewski provides a better solution , which can be seen in action here.
By invoking a dialog-prompt as follows, window.prompt('Dialog text', refvariable); , a native-UI modal-input dialog is shown, with preselected text. The user is then offered the choice to click CTRL+C and hit ENTER.

Demo:

Click on the file-path below to copy and paste the selected text into the file-dialog below, thus allowing quick navigation to a specific file.

► Implementation #1 ( window -prompt):
c:\Windows\System32\restore\MachineGuid.txt

► Implementation #2 (preselected textbox):
c:\Windows\System32\restore\MachineGuid.txt

► Implementation #3 (direct textbox):


► Implementation #3 ( using an input -tag set to readonly; inspired by goo.gl ):




Conclusion:
Copying cannot be straightforward, as the Browser is obliged to uphold an encapsulated, sandboxed environment, with many security considerations at play. Prompting the user before performing a clipboard action, offers currently the best measure which is effective whilst less obtrusive than other methods. Given the increasing spreading of flash-based clipboard copying, the issue should not be neglected and alternative fully HTML5 / DOM compliant methods should be pursued.

Monday, May 28, 2012

CSS-3 for Physics Demos : radial and linear planewave animations to potentially demonstrate the Huygens–Fresnel principle


This code-example demonstrates the usage of CSS3 gradient-functions as for css-backgrounds; converting DOM Nodelists to Arrays via Array.slice, which is often used for copying Arrays rather than generating references to Arrays, and using HTML5's querySelector to select Elements starting with a specific attribute-value.

Introduction:
A fascinating aspect students encounter in their physics curriculum, is the Huygens–Fresnel principle, named after Dutch physicist Christiaan Huygens, who was recently honored in the form of a name-bearing space probe . The principle states that every unobstructed point on a wavefront acts, at a given instant, as a source of outgoing secondary spherical waves [1], and became paramount for the understanding of wave-particle duality (See: double-slit experiment).
A plane wave is a constant-frequency wave whose wavefronts (surfaces of constant phase) are infinite parallel planes of constant peak-to-peak amplitude (Wikipedia).
Refrection in the manner of Huygens. Arne Nordmann, CC-BY-SA 3


Version 3 of CSS or cascading stylesheets, provides repeated radial and linear gradient backgrounds, thus virtually lending itself to be put into a physics demo. The ease of the implementation is yet another reminder for the end of the era of Java applets.

Demo:
Required for the Demo to work, is a WebKit based browser such as Google Chrome or Safari, althought the Demo could be adapted to run in other Browsers as well. At the moment, most CSS3 properties are vendor-prefixed, as they can be considered work-in-progress, yet prefixing hinders automatic cross-browser adoption of CSS3-properties in web-pages.

Additionally we use the CSS 3 transition property for animation of the background-position:
.cssstyle {
    transition: background-position 5s linear;
    -webkit-transition: background-position 5s linear;
}

The css-transition property takes as first argument a comma-separated list of properties to be animated, or 'all' to animate all changing properties, as second argument the duration and lastly the transition-time slope such as linear or ease-in-out with ease-in and ease-out times specifiable.

As input elements, sliders are used, with min, max, step and initial value declared - as shown in the Code excerpt (gist). The onchange property is set to change the backgroundPositionX of the corresponding figure, with the animation then performed by the Browser.









background-image: -webkit-repeating-linear-gradient(left, cyan 10%, blue 30%);








background-image: -webkit-repeating-radial-gradient(0% 50%, circle, cyan, blue 3%);





Code:
Provided below is the code for the input elements as well as the CSS rules, and the short JavaScript which serves to demonstrate the 'arrayification' of Nodelists and applying Array and DOM functions on them. Additionally, all DOM Elements are selected of which the first (the step-slider is ignored i.e. sliced out).
The gradient, and Code examples should be pretty self-explanatory, however feel free to ask or make suggestions.


With a bit of effort put into this Demo, one could potentially overlay the synchronized animations on a static image, demonstrating how an incoming planar wave hits a slit and becomes the source for a new outgoing radial wave-front. Any physics teachers interested?



[1] Atmospheric Optics, Les Cowley - DOR: 28.05.2012
[-] See WebKit-Blog Entry for CSS3 Gradients

Friday, May 25, 2012

Robust import / export of comma separated values or spreadsheets into a web-application

Described herein is a freely available library for importing and exporting any kind of separated values. As such flatfile databases (e.g. FASTA format), do qualify. Only requirement is providing a text-resource, which can be local or remote and may even be binary to some extend. In the case of an external resource, the access mode can be optionally specified, that is whether transmission should occur synchronous or asynchronous (default).
Consider this post a follow up to my previous 'musings' on csv-parsing.

Introduction
As stated once in my previous post, the one constant one may count on in the comma separated value 'business', are values which are separated. Frequently separators may indeed be commas. However,using csv as a file extension for arbitrarily separated values, as opposed to individual file-extension, allows quick recognition of spreadsheets, assigned to a typical file system icon in addition to file previews in modern file-system explorers.
Undoubtedly, csv provides it sets of perks, which is the reason for its frequent and wide adoption, yet sometimes a user-base entails scandalous file output by some programs/scripts. Whilst programs and scripts are easily fixed, large existing datasets, derived from these tools, on the other hand are not. Notably one of csv perks is its ubiquitousness. Practically this means that there are virtually no scientists who have not ever come into contact with this file format.

Importing crudely or badly formatted csv, separated values

Based on my previous post, a robust, highly configurable implementation is presented along with a demo, which incidentially is a small facet of an upcoming project. The JavaScript library, which may be run via Node.js in a server-like manner, supports importing and exporting of arbitrarily separated values, from an ASCII text file or HTML input element, even if the values are badly and inconsistently formatted. The separator is automatically detected.

CSV can be mis-formatted irreparably, without additional metainformation. For instance,  1.2.3.4.4.523.32.4.324.432.434 , wherein the dot poses as a separator and decimal mark at the same time. Wikipedia provides a list of 'Countries using Arabic numerals with decimal point' - particularly useful to quickly narrow down possible countries of origins of a person. A way around this,  is by generally separating values with canonical smilies ;) - no, that's the wrong one.

Following are examples of badly formatted csv's

Examples of badly formatted csv's

Library features and support


  • Variable comment characters
  • Multiline/Blockcomments, heredoc and partial comments spanning one or several lines e.g. content /*..\n...*/ content2
  • Automatic variable separator detection
  • Definition of commonly encountered separators
  • type interference and reduction to primitive types, e.g. Dates, Floats, Integers, Objects, Strings etc..
  • variable nullable type and missing value imputation via a callback function
  • relatively fast (improvable), avoids speed penalizing RegEx constructs such as lookaheads  
  • crudely formatted csv, e.g. too many separators on a given line, or inconsistent quoting
  • separators may occur within quoted text fields
  • mixed quotes
  • tolerance value for the allowed number of fields extending over the detected number of fields
  • cutoff number to curb fields going over the detected number of fields (_fieldcutoff = true)
Taken all features into considerations, this makes csv-lib (interim name) one of the most robust separated values import libraries available for web-applications

Possible features:

  • Dynamic changing of the separator within the scope of one process could easily be implemented, at the cost of a few extra CPU cycles
  • Multithreaded processing through the use of web-workers (will be implemented in the first release)

Implementation Notes

Lines or individual characters can be commented-out in typical C-syntax notation. Multiline comments are denoted with
/*...
...
...
...*/
Blockcomments within the same line are regarded through regular expression- based deletion of the comment. Wikipedia provides an overview of common commenting notations.

Content is processed linewise by splitting at possible newline feed and carriage return characters or a combination of the two. In principal a non-quoted field may contain any content except for a separator character. Allowed content includes binary content, meaning content containing characters beyond or below the range of printable ASCII symbols. The csv-assigned separator may be dynamic over the course of importing different csv's, and is automatically detected. At least four lines must be provided for detection to succeed.
Additionally speed matters in data-import and export. Thus the use of simple regular expressions applicable to the majority of cases and complex ones for minor exceptions is advisable. Character masking is a strategy often used to circumvent the necessity of complex regular expressions incurring lookaheads and lookbacks, and/or to conditionally apply such complex regular expressions with smaller chunks of data. Notably, regular expressions ultimately compile into conditional clauses to be performed within a regular expression engine or parser., but are often difficult to implement in an optimal manner.
Masking works by conditionally replacing characters, which impede simple solutions due to ambiguity, through a character sequence which will not occur in a given text. After applying simple processing routines, or further breaking down the data, the masked sequences are returned to their original form which incurs minimal processing cost.


Use csv-lib freely (MIT license) and design aspects of this project under the CC-BY-SA v3 licence, with an environmentally-friendly bug-spray handy.

Demo
Fig 1. A badly formatted csv table is imported from a textfield. Empty fields are queried from different web-resources.
Dynamically inserted values are shown in green. This type of user-interface aspect allows for on-the-fly editing, without impediment by UI-based table-configurations and table-editing. csv is the most concise text-based table format with which virtually all scientists are familiar with.



You can check out the Compound-csv example below and type in some compound names on your own (Coming soon! -> I still need to extract functionality and make it into a standalone demo). There are dozens of auto-completion supported columns for chemical compounds. Basically column-names decide the kind of information that will be retrieved for a given compound-related column e.g. logP, logD,InChIKey, InChI, Smiles, Weight or MW, Formula, TPSA (Topological polar surface area), charge, volume, exactmass, monoisotopicmass, complexity,...). Casing does not matter, and natural ambiguity in column-naming is provided by assigning multiple names e.g. MW or molweight to the same ontology e.g. Molecularweight
Information retrival in the standalone blog-demo is limited to Pubchem using its new RESTful interface for the Pubchem PUG API. In actuality the columns are linked to data-providers, providing you with key  information such as solubility for a given bulk solvent, as frequently required in separation sciences, especially when dealing with such heterogeneous groups of compounds as metabolites.
Herein the true power of interactive csv-table lies. To auto-fill a value just set it null (i.e. delete it). To undo your action and restore a self-provided value, hit CTRL+Z.
From a user-accessibility perspective, the system is amenable to powerusers as well as casual users with equal ease and editing speed. Lastly the demo makes use from the loosely coupled, lightweight javascript library is-lib which will see its first release soon - major restructuring pending.

That's all. Have a great weekend ahead.

Thursday, May 24, 2012

Scalable web application development for metabolic biochemistry - free book excerpt

Presented herein is a excerpt of an upcoming free book which aims to provide a starting-point for web-application developers or programmers who would like to venture into web-application development for the field of metabolic biochemistry and related disciplines.


Motivation for venturing into metabolic science:
An example of a metabolic disorder is diabetes. Diabetes itself is attributed to both hereditary and environmental factors and involves the sufferer having an abnormally high blood sugar level [1]. 
The rate of diabetes is projected to grow by 64 percent in the next 10 to 15 years in the US or 53.1 million diabetic Americans by 2025 [2].   
More than 240 million people worldwide now have diabetes. This will grow to more than 380 million by 2025 [3]


Introduction:

What? & Who?

The provision of the book is based on at least two authors collaborating, to summarize and abstract the key concepts of modern web-authoring and application building within the framework of biological and chemical sciences. After sufficient initial feedback, a publishable version should be ready within this year. This version will eventually transition to a web site and set a starting point for a curated, up-to-date wiki-book that allows anyone to contribute information on the rapidly expanding field of metabolic sciences related to the scope covered by the book.

Part I will provide concepts of the Web and its technologies in relation to scientific application authoring, Part II then directly steps into algorithmic concepts and user-interface aspects to discern, backend and frontend aspects that would ground a presumed web-application.

Why?

No similar literature exists at the moment, that would provide a concise reference focused at the specific and emerging scientific field of metabolic data integration, nor would it be attractive to publish, due to the potentially narrowed focus of the book. For the reader, the offered incentives are up-to-date, directly linked references, and information that pertain to their field of interest.
Primarily the book is targeting audiences, already well familiar with computer languages, but not necessarily with biological sciences. The guiding theme is the utilization of state of the technology such as sandboxed Browsers and HTML5, distributed service-oriented software in the context of Cloud computing, and efficient data aggregation and processing scenarios of metabolic related data. Naturally web-applications bring their own set of perks but also faults and shortcomings which are discussed and tried to work around, in the context of the current generation of web-browsers. Overtly, web-applications and the large data files encountered in metabolic sciences don't mix well, yet this issue has to be addressed for web-applications to become viable in the field of metabolic science. This issue is dissected in the second part of the Book

Background
With roughly 30 Omics fields, and ever expanding diversification and specialization of scientific knowledge, the book has to narrow its focus on one or two scientific subfields, in this case metabolomics, as part of a subdiscipline of metabolic science. Metabolomics is an emerging endeavor that arguably fits somewhere in the hierarchy of the dogma of molecular biology (see book-cover). Metabolomics has archived good results with commercial applications found in biomarker discovery and finding targets for genetic engineering such as growth optimization of algae in biofuel-related green energy research, and in plant stress to increase resilience towards stressors. On the other hand, the outlook of under-girding systems biology with hard data for the construction of bottom-up in-silico molecular models for entire biological systems, is bleaker:


How?
Unlike scientific publications, all links or publication references are directly inserted at their source of citation - or hyperlinked in the case of the resulting wikibook. As a web-page, interactive examples may be provided, along with github embedded code-snippets which can be cloned to a user-account to be modified and tinkered with. The book may eventually seek crowd-funding at some point, with donations shared between the top-contributors, should it find its intended audience, to increase quality, expand code examples and improve clarity - cumulative virtues rarely found in the form of one single author.

When?
Likely, Q3 / 2012

References to up-to-date bibliography and sources are provided, and the book is completed to around 3/4.

Any prospective contributors, and interested readers who want to provide feedback are hereby cordially invited to join this effort. I am reminded every day, through potentially hundreds of google searches, just how important community knowledge is, in the information age, and how fast information can turn obsolete.

Assisting scientists in the goal of writing better scientific web-tools, that unburdens peers in their everyday information-rich world, is part of the declared goal of this book.

Following is a short excerpt of the chapter introducing the R-programming language:

Excerpt:
Link


Bibliography
[1] SSIEM: Society for the Study of Inborn Errors of Metabolism
[2] dailyRX: Diabetes projected to grow 64 percent between 2010 and 2025
[3] United Nations Resolution 61/225: World Diabetes Day 

Monday, May 21, 2012

Sequence Viewer: FASTA to GFF - Interactively pretty printing a Protein / DNA Sequence

This blog post discusses the implementation of an arbitrary character Sequence viewer implemented in essentially one line of JavaScript code.

Introduction:
So you have a long linear character sequence of nucleotides or aminoacids -...who doesn't these days, right? ;)  An established norm for printing Protein Sequences is by splitting the sequence into blocks of ten, with several such blocks per line and sequence position numbers at the start of each new line. This formatting is for instance specified in the GCG file format.
In a web-application scenario hundreds of sequences may be retrieved from a database to be rendered at once. CSS styling of individual sequence letters would give leeway towards great outline-flexibility, but at the cost of many HTML-DOM Elements and thus Browser DOM resource usage. In other words, each HTML Element entails a comprehensive set of methods and attributes, which have to be monitored for changes in order to enable the live-update capabilities that are customary to web-pages. The more Elements added, to a given Browser Document View, the less leeway is afforded within the scope of a given project later on during web-application development.
It would be ideal to update the sequence-view on-the-fly and only outline those sequences which are currently in-view or upon user interaction, such as the mouse hovering over the sequence at least once. A similar approach is seen in todays Syntax highlighters, which perform color-outlining on-the-fly of only those code-keywords which are currently in-view. For instance, the source-code Viewer of Google's Browser Chrome  highlights only screened code-lines.

A simple implementation to faciliate on-the-fly color outlining of sequence data, is provided below, along with code-comments. Interactivity is provided through the DOM-Model.

Showing the sequence viewer with an custom CSS class applied
The entire implementation is written in one essential line of JavaScript code. First a provided raw sequence seq is split after 10 arbitrary characters, empty elements are filtered, the elements processed wherein every sixth element is concatenated with a space-padded sequence-position-number, to be finally embedded in HTML tags. Newline characters are regarded by setting the css-property word-wrap to pre in a HTML-div container.
seq.split(/(.{10})/g).filter(Boolean).map(function(e,i){return(i%6?'':'\n'+'    '.slice(0,3-(''+((i*10)+1)).length)+(i*10+1)+' ')+e.replace(/(.{1})/g,'<b class="\$1">\$1</b>')}).join(' ')


Demo:



 All that is required to create a new Sequence Viewer instance, is providing the raw sequence-data as the first parameter and the element's name in Selector-Syntax as the optional second argument to seq2gff, as follows (- at the end of the HTML Document!):
var myseq2 = new seq2gff("MDCLQMVFKLFPNWKREAEVKKLVAGYKVHGDPFSTNTRRVLAVLHEKRLSYEPITVKLQTGEHKTEPFLSLNPFGQ", "#myseq-Q84TK0");

Upon hovering the mouse over the Sequence, the Sequence will by dynamically outlined. All outlined attributes can be styled via CSS. Each Letter is assigned a CSS-class for instance the classes .D,.E {...} decorate the negative Amino Acids. Two CSS-classes may differentiate the Purines from Pyrimidines. The CSS-class .sequence {...} is applied to all sequence-Views.

At some point or upon certain (DOM) events it may be useful to clear all markup-code (i.e. color formatting) to free Browser resources. A convenient way of achieving this goal is through DOM-querySelectors and forEach applied on Nodelists, as follows:
//get all DOM Element's with the attribute id starting with the letters 'myseq'
 var nlist = document.querySelectorAll('[id^=myseq]')
For more Information see the following Code Comments of the Implementation.

Implementation - Code:
  PS: As a small plea, please share your css-styles, in the comments section via Disqus or on the code-gist page. / Adjacent, nonchanging letters may be adjoined in one HTML tag via a regular expression and back-references, but at the loss of some flexibility in terms of dynamic effects.
  Why? The mini-project came about as a demonstration of splitting strings at constant length in JavaScript, Python, R or any language with a PREG Regular Expression engine.

Friday, May 18, 2012

Regular expressions for csv, comma separated values parsing / import? - Don't! Go the 'separate' way in JavaScript

Introduction

I didn't expect to churn out another post so soon, neither did I anticipate the act of posting being the result of the complexity of regex parsing for something so simple as CSV, leading to the fallacy of writing simple regular expressions in well under three minutes and calling it a day (or night - depending on how your clock-genes tick).

Indeed few would argue the simplicity of character separated text value formats. Their prevalence, persisting throughout the decades, makes this file-format support crucial.
As a 2D data structure csv is extremely efficiency and offers at least four data-types: quoted strings, NULL values (empty fields), integers and floats and arguably references or constants. Multidimensional structures can be created if the file-system structures are used to extend further dimensions - not uncommon in natural science data and scientific data provision (for instance time-series datasets).

JS RegEx - A road to nowhere

Here is an example of what did not work in JavaScript (current V8 engine), including dozens of wild variations "4-stilbenol;aspirin; 1034;3434;3434;3434;34 ;34;".match(/([a-zA-z0-9#+\s _-]+[;,\t]?){3,}([a-zA-z0-9#+\s _-]+)?/g) - the variations extend to all kind of non-matching group (?:...) trials
The following expression actually works somewhat:
"4-stilbenol;aspirin; 1034;3434;3434;3434;34 ;34;".match(/(?:[a-zA-Z0-9#+\s_-]+;)(?:[a-zA-Z0-9#+\s_-]+$)?/g)
yielding
["4-stilbenol;", "aspirin;", " 1034;", "3434;", "3434;", "3434;", "34 ;", "34;"]
A google search will bring up results (stackoverflow, kimgentes-blog) with solutions non-applicable for roughly formatted csv. Even just for knowingly indulging a purely hypothetical case or rather outlook, which would break down due to speed issues as soon as file greater than >10MB are imported.
For comma separated values, the one constant that can be counted on, are commas that separate values.
Here is another wild RegEx example that will not help your case:
^(("(?:[^"]|"")*"|[^,]*)(,("(?:[^"]|"")*"|[^,]*))*)$

Algorithm: Fast import and automatic separator detection

Despite a claim towards the contrary, csv constitutes almost a prime example for string-splitting over regex . 

1. Split the data at \r\n, \n, \r or use a function for linewise reading
-. common separators are ':', ',', ';', ':', '||', '\t', '    ' i.e. a sequence of three to four spaces [ ]{3,4}
-. // Testing phase
2. For each of the separators, walk through the first four lines, and count the splits split('sep').length or the string's frequency strcnt('sep') 
3. Add them up as follows: if (1st_line sepfreq -2nd+3rd -4th...) < tolerance) then separator = found; break;  tolerance may be +/-1 on a generous day :)
-. // Load phase
4.* If text-values contain the separator, simply restore simple unquoted values to full quotes and split with a quote concatenated to the separator. This is computationally less demanding than 'RegEx'ing around the issue of whether a separator lies within or outside a text value-field
4. For all lines, split them with the found-separator {
5. For all splits, trim them of white-spaces (e.g. replace(/^(["'])|(["'])$/g,'')), single and double quotes
..and assign the splitted values to the lines in a 2D Array. Herein prior type-casting and missing value imputation may be performed.
}

Notably, the procedure takes > 1sec on a 6 year old PC in a rough C++ implementation for a 10MB csv file.

Conclusion:
The outcome is a 2d-matrix ready for any further computation. I should note that this algorithm is included in an upcoming comprehensive pre-parsing detection library, soon to be released for bioinformatic web-projects.

JavaScript Implementation
Result:


CSV Examples:

Thursday, May 17, 2012

Analysis, thoughts on Google Introducing the Knowledge Graph: things, not strings (Updated)

Yesterday Google (NASDAQ: GOOG)  announced on its Blog, a new relational search feature to be gradually rolled out over the coming days. Whilst formally not new, it was considered novel enough to make it official, in likely full awareness of the ensuing media fuzz.

2500 Mentions after a 16h window (Data provided by Topsy Inc)


This announcement involved several significant points and discoveries for me:


  1. Google overtly cleans-up its code, often including state of the art technologies present in the Web. From that perspective the lack of a keyword tag <meta keywords="..." />  is virtually a statement in itself to SEO's, SEO related businesses and website administrators: Meta-keywords are obsolete! 
    Meta-tag keywords have the tremendous drawback that they rely on the truthfulness of the content-provider, rather than relying on information which is generated by own algorithms on the nature of the content.In other words, why involve the third party stating the truth if you can just mine the truth, at the expense of some CPU cycles. Trying to beat the system thus at least involves algorithms thus some effort, as opposed to none
  2. When looking at Google blog's content traffic, the use of image sprites to avoid HTTP roundtips are an ubiquitous asset in Google's web repertoire for 3+ years. Interestingly some of the sites JavaScripts also passed along base64 encoded images. Moreover, Chrome can directly read such base64-strings with a recognized MIME-type attribute from the url-address bar, and a drag and drop operation will bring these back into binary form eating up space on your hard-drive (or... cloud-drive?) You can try to copy and paste the content below into your address bar. It will show an animated loader-gif image. Update: Works in Opera >10.5, Firefox as well (not all MIME-types).
    data:image/gif;base64,R0lGODlhIAAIAKECAEVojoSctMHN2QAAACH/C05FVFNDQVBFMi4wAw
    EAAAAh+QQFCgADACwAAAAAIAAIAAACFZyPqcvtD6KMr445LcRUN9554kiSBQAh+QQFCgADACwCAA
    IAEgAEAAACD4xvM8DNiJRz8Mj5ari4AAAh+QQFCgADACwCAAIAHAAEAAACGJRvM8HNCqKMCCnn4
    JT1XPwMG9cJH6iNBQAh+QQFCgADACwMAAIAEgAEAAACD5RvM8HNiJRz8Mj5qri4AAAh+QQFCgAD
    ACwWAAIACAAEAAACBZSPqYsFACH5BAUUAAMALAAAAAAgAAgAAAIOnI+py+0Po5y02ouzPgUAOw==
  3. The cat is out of the bag: Google is indeed heavily invested in Graph databases  and Process graphs (Pregel: a system for large-scale graph processing, Greg Malewicz, Google, Inc.; Matthew Austern, Google, Inc.; Aart Bik, Google, Inc.; James Dehnert, Google, Inc.; Ilan Horn, Google, Inc.; Naty Leiser, Google, Inc.; Grzegorz Czajkowski, Google, Inc., SIGMOD 2010, Published by ACM 2010). Owing to their algorithmic nature graph databases are well amenable to parallelilzation particularly the Map-Reduce computing paradigm (Parallel Data Processing with MapReduce: A Survey - SIGMOD 2011 by KH Lee - PDF). 
  4. Finally, it is becoming clear that Google cannot afford to make any major product without offerings towards the/'its' mobile market. With Android now having a significant lead over iOS, that strategy is a no-brainer. Now I am gradually realizing it as well, manifested in recent dabblings in Phonegap
    Update: The mobile connection of the knowledge-graph has been more or less affirmed through this official blog post.
    Indeed targeting the mobile market is most sensible, due to the inherent limitations of the input device imposed by mobile form-factors. 'Power-googlers' (i.e. users that spawn a new tab every minute and enter a new search every ten seconds) are likely to remain unimpressed by the new additions. 
  5. (Google is increasingly no longer content with half of the search-site remaining blank? I thought that 'is Google'. Is the increasing screen-real estate at fault or is the lack of competitors leading to general discontent ;) ?)

Tuesday, May 15, 2012

Web development and Javascript tricks and tips - a WIP compilation

I started putting together a list of neat tricks and tips, which haven't been frequently blogged about and which I picked up as I went along the way of web-development. Eventually the list should feature around ten entries.

So without further ado...
1) Web Development Tips and Tricks

2) Adding Content Before or After an HTML Element via CSS pseudoelement :before , :after
3) CSS Most important text overflow and wrapping options
4) A list of common HTTP header settings
5)JSON.stringify: Forcing Array notation for JavaScript's serialized function arguments

Thursday, May 3, 2012

RegEx Text Import w. MySQL Query Browser - ad ETL methods

Introduction:
MySQL Query Browser is a no longer supported cross-platform SQL Toolkit based on TrollTech's QT-Platform. Significantly it allows many automated or semi-automated routines for creating, running and editing various MySQL resources such as Views, Functions, Procedures, Triggers among others and allows syntax-highlighted editing of multi-line SQL statements. An undocumented feature is RegEx-Text Import, which is one of several useful additions to performing ETL (Extract Transform Load) tasks of naive (or slightly pre-processed) data.


The undocumented RegEx Text Import Feature of the MySQL Query Browser for running simple ETL tasks.
Doing a quick google search turned up a range of user comments such as 'never-fully implemented', 'non-functional' to 'being a future implementation', which prompted me to write a short commentary/post.

Like most aspects within the Query Browser UI, the RegEx Text Import assumes a Drag-and-Drop user interaction. RegEx adheres to MySQL string conventions, which will make it necessary to escape RegEx syntax with triple-slashes. For instance:

This simple SQL query performs a query that searches for compounds with an α-amino-acid  function :
SELECT DISTINCT a.Compound_common_name FROM aracyc_compounds a WHERE a.Smiles REGEXP '(C\\\(.?N.{0,2}\\\)\\\(?C\\\(=O\\\).?O.{0,2}\\\)?)';

Methods:

The first order of business is thus loading the source file containing the naive data and adding a RegEx, - which evaluates on the fly (- a green text underneath the RegEx textbox literally gives you the green light). You can further subdivide your RegEx by right clicking on the match and select 'Create Nested Expression'. At any time you can remove a RegEx by right clicking on it and selecting 'Remove RegEx'.
Underneath the source-data textfield, is a textfield containing INSERT INTO.. serving as the text-transform template. Simply drag and drop your desired matched into the corresponding field (e.g. $1, $2) - and in case of the match being a string, double quote accordingly - unless your string contains double quotes, in which case single quotes or encapsulating the entire expression in an  MySQL String escape function will do the job.

Notably the RegEx match variable is close to most RegEx implementations, as demonstrated in the following JavaScript example:

var src = "\ 00010 Glycolysis / Gluconeogenesis\n\ 00020 Citrate cycle (TCA cycle)\n\ 00030 Pentose phosphate pathway\n\ ";


//heed the multiline-flag!

src.match(/(\d{5})\t(.*?)$/im)
["00010 Glycolysis / Gluconeogenesis", "00010", "Glycolysis / Gluconeogenesis"]



RegExp.$1
"00010"
RegExp.$2
"Glycolysis / Gluconeogenesis"

You can find more ETL related resources and script on my gist account. Be advised that this is but one of many ETL methods I employ, depending on the scenario and the nature of the data. I intend to provide a comparison in a future post.