Monday, May 13, 2013

MySQL: fuzzy searching fulltext queries

Introduction:
Whenever a user-search on a constrained set of rows comes up empty, it is desirable, from a user-experience point-of-view, to re-perform the search in a fuzzy manner. That is, allowing the search-term to match with a certain error and bias. After all, users have grown accustomed to technology that assists rather than hinders.
Searching for the German word Kompetenz, instantly returns a table-result.
Performing a google-search for "mysql fuzzy search", brought up this stackoverflow question: How do I do a fuzzy match of company names in MYSQL with PHP for auto-complete? Among the results were also similar, yet unfinished attempts.
Frequently pointed out is the  Levenshtein distance, which in combination with a certain cutoff score, provides a great way of fuzzy-matching strings. Unfortunately, MySQL >5.x does not provide a Levenshtein function. There is the option of stored procedures, even compiling libs and binding them at runtime. Procedures will require the database-user to have the necessary GRANT PRIVILEGES ON USER, and on cheap or free LAMP (Linux Apache MySql PHP) hosts, this is generally not never the case.

Soundex is a phonetic algorithm for indexing names by sound, as pronounced in English.


MySQL does however provide a SOUNDEX function natively, which can return a result not limited in length, based upon the original Soundex implementation developed by Robert C. Russell and Margaret K. Odell - and patented in 1918, almost a century ago. The original implementation discards vowels first and duplicates second. It is generally the other way around in subsequent implementations. Examples for MySQL SOUNDEX are provided below.

The Soundex algorithm in brief:

  • Retain the first letter of a word
  • Replace similar sounding sets of consonants through digits. sets are numbered between 1 and 6 e.g. {b, f, p, v} := 1
  • Include English language idiosyncrasies: e.g. h, w between the same digits, retains the first digit only 
  • Stops after a maximum of three digits (This is not the case in MySQL Soundex. but true for PHP Soundex)

Implementation



Description:

SUBSTRING is used to discard the first letter, thus constraining the search to numbers only. Using numbers only can be used to one's advantage when speed is of the essence, by storing a NUMERIC field in a temporary table or as an extra field in the data-table, which only contains SOUNDEXed digits of the original text-field, that is to be searched. MySQL may then optimize the search.

CONCAT  is used to concatenate strings, as MySQL lacks a string concatenation operator. The output of CONCAT serves as the input for LIKE function. The search term, which was reduced to a SOUNDEX number, is attempted to be matched anywhere within the SOUNDEX-reduced text field.
As such the algorithm is fairly easy and does have some quirks: Unlike natural language, which is separated by spaces into words and thus serve as a reference frame, the same does not hold true in this scenario. In the algorithm described all spaces are disregarded. Additionally SOUNDEX is optimized for the English language, meaning that consonants that sound similar in English are grouped and assigned a number and that Unicode language-alphabets are generally not supported.

Nonetheless, SOUNDEX is fast, and the algorithm works fairly well for typical scenarios such as small tables in Customer Relationship Management and Content Management Systems. An example can be seen in the screenshot, showing a result being returned in English despite the similar sounding search term being in German.


Notes:

  • SOUNDEX results often have a character size of four, since everything beyond three digits is cut off.
  • MySQL 5 brings with it a Fuzzy searching function called MATCH AGAINST. The function allows weights to be assigned to words, and summarized scores to be ultimately ordered by relevance of a given Fulltext field entry. A FULLTEXT index is advisable, which rules out using MATCH AGAINST out for large, monolithic databases.
  • The presented scheme through SOUNDEX and LIKE is already Map-Reduce-able.


Thursday, May 9, 2013

The best JavaScript documentation generator tools and reference builders

Never before has JavaScript seen more demand or attention, by both developers and users. JavaScript, besides Lua, are the only two popular prototypal programming languages around, bringing with it, their own set of peculiarities.

Quality code, that is maintainable in a team and can be well tested, is a crucial advantage for larger projects. It goes without saying, and few would argue, that a minimum set of qualitative documentation goes hand in hand with the maintenance of a vast codebase. Unlike its early days, JavaScript now has a great number of established documentation-creation and publishing tools. Find them below, and please share what you are currently using.

comprehensive; conventional style; lean and orthodox docs




lean, mean and clean; fast with node.js; little to no learning curve; several stylish templates; responsive designs; permalinks




comprehensive, powerful; vast codebase; rich templates and designs; HTML5; permalinks




common syntax; comprehensive Design, HTML5; fast; moderate learning curve




common syntax; simple; fast; conservative design


common syntax; comprehensive designs and templates; moderate building time; permalinks; not well known?



common syntax; comprehensive designs and features; HTML5; moderate building time; permalinks;


Others:


Notes:

The article was inspired by this compilation post at google-plus, which is really just a straight summary taken from the stackoverflow community, to which all the credit goes.

If you are interested in my own generator-preference for small projects: it is docco. You can find an example documentation-output of the current docco-version here.

Tags:

JavaScript JS documentation generator tools 2013 HTML5 docs maker collaboration

Saturday, May 4, 2013

The best online regular expression / RegEx testers and builder-tools as of 2013

High level programming without the power of regular expressions would be much less fun. They may often be crucial, when drafting a first program on a tight schedule.

Regular expressions are thus likely Donald Knuth's best friend, whose famous quote is:
premature optimization is the root of all evil

The top choices for building regular expressions are listed first.













  • http://www.regexplanet.com/simple/index.html - fast, versatile, bootstrap based - responsive design, works on mobiles, many options. multiple input-string testing. multiple language support:  Java, PHP, Ruby, .Net and python






 




Others:





Notice: Adobe Flash based options were omitted from the list. XnView was used for screen-captures.

Saturday, April 27, 2013

JavaScript: Array and Matrix Initialization, Sorting Dictionaries and Objects - included to GIST.JS

Two new functions made it into gist-lib, the successor of the technical JavaScript library is-lib, which I created during my stay at the excellent Systems Biology Group MoSys. The purpose of the library is the provision of a minimal set of functions, commonly used during the creation of technical web-applications. Technical being the emphasis. For instance, the client-based loading of remote files into a web-application, and subsequent value-filtering. There are some choices provided by web-framework libraries or PHP.JS, but these quickly bulk up the project's size. Underscore.js is a recent alternative that follows a similar idea.

1st Function: Dictionary / Object sorting in JavaScript:

Nested Objects are JavaScript's go-to variant for dictionary-implementations. Unfortunately, JavaScript has no comfortable, PHP-like way of sorting nested objects. See the StackOverflow Question: Sort a dictionary (or whatever key-value data structure in js) on word_number keys efficiently


One possible implementation is provided below, and will be followed up with a usage example.


2nd Function: Initializing Arrays and Matrices in JavaScript:

JavaScript does not allow straightforward array initialization, but in fact comes with its share of pitfalls. The prototypical example being: [Array(0), Array('0')]  . In this instance, the first array would have a length of 0, whilst the second one would have a length of 1. At that point, no explanation may be necessary anymore other than pointing out that the number supplied to Array.prototype.constructor(/*Int*/ i) sets the length attribute of the array, and allows the JIT compiler to optimize memory allocation on the current stack, yet the elements remain uninitialized. That may not be a bad idea, given that JavaScript only has two base values: undefined and null.

To initialize arrays the following function is useful:


GIST.JS will be released by the end of May.

Sunday, April 21, 2013

Google Chrome internal special pages for Developers : 2013

With the increasing popularity of the webkit browser engine and Google Chrome Browser, a lot of web-developers are primarily working with the toolset Chrome provides. There is a lot more to Chrome than meets the eye, as it turns out:

Google Chrome's global webkit profiler, accessible through chrome://profiler/


Summarized below is an up to date list with the most important internal Chrome pages, which I compiled when I attempted to dig into chrome-crash data. At this point some readers may be interested in the Google Chrome crash recovery project. You can conventiently crash any page at any given state, through careful application of conditional debugger statements, by entering chrome://crash. This is useful when hardening complex web-applications and/or trying to locate obscure webkit-related problems.

Sunday, April 7, 2013

Web browser usage share between 2011-2013: two thirds are using Firefox, Chrome and Safari. Is Android being under-reported?

Introduction

Running a technical blog is interesting as the blog's constituency, so to speak, prefers fast, feature-rich and standard-conforming web-browsers. In principle, since the days of HTML5, browser-quirks arising due to "W3C standards-ignoring-browser-vendors", shouldn't be an issue anymore. But as recently seen with the do-not-track (DNT) draft, at least one browser vendor ignored the recommendations anyways. Suffice to say, standards go only as far as its adherence and adoption.

A key question for web-developers are the presently predominant sandboxes/platforms on which their applications will be deployed. AI-class[1],led by Sebastian Thrun and Peter Norvig, which comprised a highly technical audience,  reported[2] usage statistics of 41% usage-share for Google Chrome [webkit], 33% usage-share for Mozilla Firefox [gecko], 10% usage-share for Apple Safari [webkit], and finally a usage-share of 9% for Microsoft Internet Explorer [trident?]. This is surprising since Microsoft Internet Explorer is often reported by Alexa's best-ranking web-sites, to hold its grip around 30% market share, when  no visitor preference is discernible.
Note: This paragraph dates back to 2011. Bracketed are the respective browser engines.

Is the Android usage-share under-reported?

Since 2011, and prior, Android has seen tremendous growth:
Android growth in device activations, Source: Google
Yet Android usage-share is still reported by Google-Analytics to be at less than 1% for this blog. Personally I use Opera Mobile when I am mobile, in part due to its compression and fast renderer. Likewise many Android users have strayed away from the built-in Android browser and exclusively use third-party browsers like Dolphin. The high third-party browser usage-share on Android could partially account for  Opera's over-representation in Google Analytics, which empirically should be much lower, - lingering at around 3% usage-share, when only traditional desktop visitors are c.
Google Analytics, reports Android as the Android-Browser, which itself has lost much user-interest since its early days to other browsers, to large extend Google's own multiplatform Browser Chrome. Many of the third-party web-browsers can emulate the "user agent strings" of Desktop web-browsers to force the reception of desktop website content. 

Regardless, a clear-cut verdict or analysis has yet to come. Shown below in Tab 1.0 is the shift in usage-share of web-browsers towards webkit-browsers and Firefox, which now own a market-share of roughly two thirds.


Browsers [%]Systems [%]
Nov 2011
Apr 2013Image displaying most popular browsersImage displaying most popular platforms
Tab 1.0 Operating System and Browser Usage-share of Blog visitors, over a duration of approx. 1,5 years

Gather statistics

Part of the incentive for writing this post is to crowd-source different statistics reported by other users and bloggers, with the aim to derive a consensus figure of Android usage share on technical sites. That of course requires that users share their site-statistics more frequently.




See also

Python: Ripcage - multithreaded download of web resources

ripcage is a python script, which is clean and easy to adapt. It serves as an demonstration of using the multithreaded workerpool module to download multiple resources in parallel - in principle through any protocol and authentication supported by your python libraries/modules such as TCP/IP, UDP, OAuth, TSL, SFTP, FTP,.. The initial version also entered meta information of each download into a MySql database, which is stripped for conciseness.  You can invoke any command line tool / script to post-process each resource once fetched.

Given that Python is fast and efficient to use when it comes to protocol usage and implementation, there is no limit of what you may fetch. Fork and adapt it from here.