Jude O'Kelly – Page 3 – Just another WordPress site

WebSite Performance Optimisation – Core Concepts

When it comes to performance tuning a site, there are a multitude of possible optimisations so I thought it best to distill these down to several core concepts.

Central to most of these concepts is an oveview of how a web page is loaded in a user’s browser. The below ‘waterfall’ diagram shows the loading process for the page. The bars represent the total time to load each page asset, the first request is for the page and then subsequent separate requests are made for each asset referenced page (such as image file, css etc, javascript file etc).

Note that in this article I will focus solely on front-end optimisations since for most web-pages server processing typically only accounts for between 10% – 15% of the total page load.

Distribute Requests
Older browsers were only permitted to open a maximum of two simultaneous connections to a domain. Thus if the page contained references to numerous assets (such as images, css and javascript files) these would be queued and loaded two at a time. Thus a quick optimization was to distribute the assets on subdomains (eg images could be on img.mydomain.com) which would allow for more files to be loaded simultaneously.
Modern browsers allow for more files to be loaded concurrently (Chrome for example allows for six), but this is still a very powerful optimization although you should now use a Content Delivery Network (CDN) to host static files. A CDN is a globally disbtributed network of servers which caches static files and serves these to a user from the closest physical server (or ‘edge location’). Using a CDN has the benefit of reducing the network latency in loading static files, since the site visitor will be served the static files (such a page images) from a server close to them.

One caveat in using a CDN is it can be problematic when working with CSS and Javascript files which may need to be updated. A CDN caches files at its various edge locations around the world and even after a file is updated, the old cached file may still be served until the TTL (Time to Live) expires which can often be days or even weeks. Thus site visitors may be served out of date CSS and Javascript files after a site update. One remedy to this issue is to version the files (eg adopting a naming convention such as ‘myjsfile1_001.js’) and so create a fresh file name for each update for which no cache exists.

Reduce The Number of Requests
In the belowexcerpt of the waterfall diagram of the page load, the yellow portion of each bar is the time taken to to download the file, the blue portion at the start is the time to open the connection to the server. Note that for the bottom three files the largest portion of the load time is opening the connection.

The time to open and close each connection is unrelated to the file size and so the time to open a connection to download a large file is the same as for a small file. Thus it would save a lot of load time if numerous small files where combined into large files. The most obvious candidates for this are CSS and Javascript files. When working with templates and frameworks dozens of these files are often requested, each of which requires separate load-time overhead.
Combining these files is often problematic in development since working with a single huge javascript would be very inefficient, however there are numerous solutions for combining these files upon deployment. ASP.NET MVC 4 ships with inbuilt bundling or other open source solutions such as SquishIt could also be used.

Reduce The Size of Files
Reducing the size of the files served to the user is an obvious and necessary step in performance optimisation. The first step is to look at the html itself – ensure there are no inline styles in the html, these are not only inefficient for development purposes but they impair performance since styles in external stylesheets are cached by the user’s browser and so do not need to be loaded on each page request.

Images files should be in an appropriate format. In general jpg files are larger and should only be used for pictures or graphics which make heavy use of gradients. A quality setting of above 80 is almost always overkill (although this may change with retina displays), typically a setting of jpg quality setting of 60-70 is the sweet spot for the quality/size trade-off. Simpler graphics such as logos or screenshots should be either gif or png formats, png is certainly the preferred format now for images of any complexity since it offers very good image quality (the screenshots in this article are in png format). The simplest page elements such as arrows, pointers, lines etc should normally be gif since this format is capable of the smallest image file sizes (note that these elements can be combined into a single larger image using CSS sprites ).

Static files such as CSS and Javascript files benefit from minification (which is typically removing whitespace and replacing long variable names with shorter ones). There are several open-source minifiers such as the YUI compressor.

Perceived Performance

Web pages load sequentially so placing large Javascript files at the top of the page can block the html below it from rendering until the file is fully loaded. Placing Javascript file at the bottom of the page does not affect the total page load time but allows the Html content to be shown to the user and then the Javascript file to be downloaded. Note that this technique should not be used for CSS files which are generally integral to the page being displayed.

A Simple Visual Random Number Test

Random numbers are very difficult to test since there are so many ways a series of numbers can exhibit non-random properties. Correlation between consecutive number in a random sequence and an uneven distribution of numbers across the entire sequence are just two important tests.

I was writing a simple random routine in C# and made the mistake of placing the declaration of the Random object inside the loop:

private ListRandNumbers()
{
var numbers = new List();

for (int i = 0; i < 100000; i++)
{
var rnd = new Random();
numbers.Add(rnd.Next(1, 1000));
}

return numbers;
}

In .NET this will produce a very poor series of random numbers since by default the Random object is automatically seeded with a DateTime and so if it is placed within a loop it will be created and seeded multiple times with identical values. Developers normally write tests to check an application’s data quality , but writing a battery of statistical tests to check a simple random number routine is working correctly is overkill.

A neat solution to this is to use a simple visual inspector with plots the values on a two dimensional surface. The below ASP.NET MVC controller would output the a bitmap of the above random number routine.

public ActionResult RandomNumberImage()
{
var path = @"c:usersyourfilepathfilename.bmp";
Bitmap bm = new Bitmap(1000, 1000);
var randNumbers = RandNumbers();
var randNumberCount = randNumbers.Count();

for (int i = 0; i {
bm.SetPixel(randNumbers[i], randNumbers[i+1], Color.Black);
}

bm.Save(path, System.Drawing.Imaging.ImageFormat.Bmp);
return File(path, "image/bmp");
}

This outputs the below visualisation:

Definitely not a random data set! Now we can correct the mistake and instantiate the Random object outside the loop:

private ListRandNumbers()
{
var numbers = new List();
var rnd = new Random();

for (int i = 0; i < 100000; i++)
{
numbers.Add(rnd.Next(1, 1000));
}
return numbers;
}

Now lets look at the output of the correctly implemented routine:

This appears much better. In addition to a broader scatter, note that there appears to be more data points, this is because in the previous routine numbers were often identical and hence plotted on top of eachother.

Visualisation can also help surface more subtle issues in random number generation. The below image is from a data set generated using the PHP random number generator on a Windows machine:

Source : random.org

The relationships between these numbers may be quite difficult to discern using statistical tests but for the above image there are clearly some relationships within the data set and hence it would not be random.

Note that this is not a robust statistical test, the inbuilt random number generation in .NET is generally considered to perform poorly (as is the case with Excel random numbers)

Testing ASP.NET MVC Server Side Validation

I came across an online random number generator recently which provided an input for the number of random numbers to be generated. Since I was looking for large samples I entered 1000000 to see what the response would be. Since such a request would likely cause too much server load, the site developer had protected against this and the site popped up the message:

My suspicions were aroused as to how completely this had been implemented. I fired up Fiddler, and modified the Http Request to set the random number to 8500. The site then happily returned the response with 8500 random numbers. Clearly the developers had only implemented this as a Javascript test on the browser and the server itself would accept and process any request. An attack could easily be mounted by sending off several simultaneous requests for a massive number of rows which would likely overload the server and bring the site down. In reality, modifying the Http post request is not even required as disabling Javascript on the browser would circumvent the number test.

Framework Protection

Frameworks such as ASP.NET MVC and Ruby on Rails provide client side and server side testing out-of-the-box. In ASP.NET MVC for example, adding a the [Range] data annotation in the model will result in both server side and client side tests being implemented:

public class RandonNumberViewModel    
{        
[Range(1, 2500)] //Only numbers between 1 and 2500 will be accepted 
public int RandomNumbers{ get; set; } 
}

Now when the developer uses this model to generated a strongly typed page it will automatically implement a Javascript test, in addition the model returned by the Http Post will be marked as invalid on the server (for full details on how to implement this see ASP.NET MVC Model Validation ).

When model validation is correctly implemented it provides excellent protection from Http Post modification attacks. However, small code mistakes such as omitting a check of ModelState.IsValid prior to process a request will leave the application open to attack. Thus it is essential that developers do not blindly trust the framework to fully protect a site from attack, and test the site’s security by simulating an attack prior to going into production.

Server Side Validation Testing With Http Request Modification

The first step is to install an Http debugging proxy, the most popular of these on Windows is Fiddler which is free and well supported although the interface can be a little difficult to navigate, a good alternative is Charles Proxy which is a paid product but easier to use. I will use Fiddler in this demo.

The default setup in Fiddler shows a list of Web Sessions in the left pane (see below). Typically each ‘web session’ is an Http Request and corresponding Http Response. Selecting a Web Session, and then clicking the Inspectors tab as below will load the Request in the top right pane and the Response in the bottom right pane.

Fiddler provides a wealth of information on the Http traffic but for this demo we are interested in modifying and sending a Post request. To do this first send a post request from your browser by submitting a form, this will then be captured in the Web Sessions pane of Fiddler. Then select the Composer tab on the right of the screen. For this demo I will use the default ASP.NET MVC site that is included in new Visual Studio ASP.NET MVC Internet projects. I will attempt to circumvent the restriction for passwords to be a minimum of six characters.

First I will attempt to register an account with a short password, this yields the below result when the register form is submitted from the website.

This is the client side Javascript preventing me from entering a short password. Now that I know the Javascript prevents an incorrect form submission I will send off a correct form submission with a valid password. This succeeds and I now have a post submission in my Web Sessions listing:

Now, select the Composer tab and then drag the Web Session to the right hand pane (this is just a short cut for copying the Http Post so we can edit it). Now in the Request Body pane simply modify the Password and Confirm password values to invalid values (ie ‘judo’) as below and click Execute.

This will manually send an Http Post request and there will be a Web Session for the request and response. Select the session and the lower pane will show the response. The WebView tab will show a crude rendering of the Html in the response :

As can be seen the server tested the password and found that it was of insufficient length and did not register an account, and so both the client and server side validation are working correctly.

MVC Security – Model Binding Vulnerability

In my article on Parameter Tampering in ASP.NET MVC I focused on the general techniques of parameter tampering and defenses against it. In this article I will examine a specific type of parameter tampering which is often termed Mass Assignment.

In most MVC web development frameworks (including ASP.NET MVC). Model binding works by assigning an incoming html Form’s values to an object’s properties. This makes for very efficient development but one that contains a critical security flaw.

Take the below Employee model:

public class Employee
{

public int ID {get; set;}
public string FullName {get; set;}
public DateTime DateOfBirth {get; set;}
public bool IsAdmin {get; set;}

}

Note for now that the IsAdmin property is for determining if the Employee has admin rights on the application.

Assuming an employee creates a record themselves in the web app the relevant part of the html form will be as below (note that the id field is automatically set when the record is created in the database and so is not present and the IsAdmin field is not exposed).

<input id="fullname" name="fullname" type="text"  >
<input id="dateofbirth" name="dateofbirth" type="text"  >

When the form is submitted it will produce the below Http Post Request:

Request URL:http://yoursite/register
Request Method:POST
Status Code:200 OK
...
...
fullname:Jude OKelly
dateofbirth:22-04-1972

In an MVC framework like ASP.NET MVC an incoming Http Post/Get request is handled by a Controller’s method. In the scenario of a form being posted to the server, the MVC framework can automatically create an object and populate its properties with the form values:

[HttpPost]
public ActionResult Register(Employee emp)
{

//the emp object is automatically created and assigned the values from the html form:
db.Employees.Add(emp);
db.SaveChanges();

}

The above example uses an ORM such as Entity Framework to save the emp object to the database. So the database will contain the posted values for FullName and DateofBirth. The ID property is automatically assigned and since the IsAdmin value is not provided it is given a default value of False. Thus, in this case the newly created user will not have admin rights on the system.

However, using a tool such as Fiddler it is relatively simple to alter the html Post to include ‘IsAdmin:True’:

Request URL:http://yoursite/register
Request Method:POST
Status Code:200 OK
...
...
fullname:Jude OKelly
dateofbirth:22-04-1972
isadmin:true

Once the ASP.NET MVC model binder receives this http post request it will attempt to match as many properties of the incoming form with the Employee object, since the IsAdmin property is now provided it will be set and saved to the database. Thus the new user will now have full admin rights to the system.

Another way to inject ‘IsAdmin=true’ would be to simply add it to the querystring when posting back the page. ASP.NET MVC (as well as other MVC frameworks) will look for model values in the html form values, the querystring and route values, and cookies.

Solutions

There are several solutions to this vulnerability:

Matching Incoming Parameters

Instead of using a full object in the method’s constructor, just declare the parameters that are expected and manually bind them to the Employee object:

[HttpPost]
public ActionResult Register(string fullName, DateTime dateOfBirth)
{

var emp = new Employee{FullName = fullName, DateOfBirth=dateOfBirth};
db.Employees.Add(emp);
db.SaveChanges();

}

In this case that any additional data such as IsAdmin=True in the form post will be ignored and not make it into the database.

Use A ViewModel

A very common practise in MVC web development is to create custom models for each page (or ‘View’ in MVC parlance). These models can be a composite of different domain models or just a subset of properties of a single domain model.
In this case we would simply declare a ViewModel as below:

public class RegisterEmployeeViewModel
{

public string FullName {get; set;}
public DateTime DateOfBirth {get; set;} 
}

Note this ViewModel only contains the two properties of the Employee object we wish to receive from the html form.

Now in the Controller we will receive this model and use the MVC model binder to automatically create the RegisterEmployeeViewModel object and populate its properties with the values of the html form:


[HttpPost]
public ActionResult Register(RegisterEmployeeViewModel empViewModel)
{

var emp = new Employee{FullName = empViewModel.FullName, 
                       DateOfBirth= empViewModel.DateOfBirth};
db.Employees.Add(emp);
db.SaveChanges();

}

In this case we created a new Employee object and populated is values with the ViewModel object’s properties, in practice various mapping solutions can do this automatically to simplify the process.

Whitelist / Blacklist Parameters To Bind

ASP.NET MVC provides the Bind attribute which can be used to either specifically include or exclude parameters for model binding.

Whitelist parameters to be included in model binding (note only the UserName will be used in binding) :

 

[HttpPost]
[Authorize]
ActionResult UserDetails([Bind(Include = "UserName")] User user)
{ 
	//Update Logic
}

Blacklist parameters to be excluded in model binding (all parameters except IsAdmin will be used in binding) :

 

[HttpPost]
[Authorize]
ActionResult UserDetails([Bind(Exclude = "IsAdmin")] User user)
{ 
	//Update Logic
}

This approach has the weakness that it requires the Bind attribute to be used on every controller action where a mass assignment model binding vulnerability may exist, which is a reliable approach – a new controller action could easily be added in an update which omits the crucial bind attribute. In general this defense should be thought of as an extra layer of security to be used in conjunction with other approaches.

Unfortunately, so many of the early demos of ASP.NET MVC featured this simplified model binding scenario that the security flaw is very common in production sites (or at least that is my experience).

Entity Framework Performance Optimization

Query Tuning

Minimise the Data Requested

Minimising the data queried is a fundamental tenet of performance tuning for any method of data access. In querying a database using SQL, the below query should almost never make it into production :

Select * From [Table]

Instead, the query should specify the fields to be returned from the table. This is extremely familiar to any database developer or DBA. However this principal is often ignored when developing with Entity Framework.

In a recent project I was working on an alerting system which would send an email alert when a company released an SEC filing. The Alert entity had a related Company entity:

public class Alert
{
public virtual int ID { get; set; } 
public virtual DateTime CreateDate { get; set; } 
public virtual Company Company { get; set; }
//......more properties
}
public class Company
{
[Key]
public virtual string Ticker { get; set; }
public virtual string Name { get; set; }
//......more properties
}

The UI needed to display the CreateDate of an Alert and the Name of the Company. How should this be written in LINQ-to-Entities?
Often, the related entity (Company in this case) would be requested using Include() as below:

var testQuery = db.Alerts   // db is the instance of the DbContext
.Where(x => x.ID == 1)
.Include(y => y.Company);

This is a common pattern, however, it is actually the equivalent of loading the entire rows in both the Company and Alert tables in the database. Look at the SQL this generates:

That is a lot of data requested when all that was needed was the CreateDate and Company Name. One way to solve this is by using an anonymous type which only selects the data requested :

var testQuery = db.Alerts.Where(x => x.ID == 1)
.Select(y => new {
CreateDate = y.CreateDate,
Company = y.Company.Name
});

This would generate the below SQL which is a vast improvement.

Anonymous types are not always that convenient since they cannot be used outside the scope of the method they are declared in. There are several alternatives such as creating public classes with only the properties which are required (these public classes would in some ways be analogous to Views in SQL Server since they would be a light-weight composite of the properties/fields of other entities).

Do Not Count After Query Execution

To perform a simple count, ensure that you include Count() in the LINQ statement (which will only generate a SQL COUNT query) and don’t return an IEnumberable and then execute a Count() on it which returns all the entities and then performs an in memory count. There are several varieties of this issue:

var empCount = db.Employees
.Where(x => x.CreateDate < DateTime.Now.AddDays(-10)
.Count() //only executes a SQL COUNT
var emps = db.Employees
.Where(x => x.CreateDate < DateTime.Now.AddDays(-10)
var empCount = emps.Count() // returns all matching employees and then executes a count

Filter Before ToList()

Another issue is executing ToList() prior to any filtering. ToList() will instantly trigger a database query for all LINQ statements prior to it being hit, subsequent LINQ filter statements such as Where() Take() Skip() etc will then be executed on the in-memory List:

//ToList() executed too early
var emps = db.Employees
.Where(x => x.CreateDate < DateTime.Now.AddDays(-10)
.ToList() // All employees created within last 10 days queried
.Take(50) // List is filtered for the top 50
//ToList() executed last
var emps = db.Employees
.Where(x => x.CreateDate < DateTime.Now.AddDays(-10)
.Take(50)
.ToList() //Query made for top 50 employees created within last 10 days

Don’t Load Entities To Delete Them

A common problem is loading an entity which needs to be deleted. The DbContext Remove() method requires an entity to be passed in, so if only provided with a primary key the below code would effect a delete on the database:

var emp = db.Employees.Find(empID); //Executes a query against the database
db.Employees.Remove(emp);
db.SaveChanges(); //Executes a second hit to the database

Instead of this, create a temporary entity and use Attach() to attach it to the context and then call Remove (it is not relevant that the entity will not have all its properties loaded – all that is needed for the Remove() method is the primary key):

var emp = new Employee { ID = empID }
db.Employees.Attach(emp);
db.Employees.Remove(emp);
db.SaveChanges(); //Executes a single hit to the database

Overall, the primary issue to bear in mind is that when an entity is requested all the properties on that entity are queried and returned unless the query states otherwise.

Profile Generated SQL

A problem in using Entity Framework (especially versions 4.1 and later) is that it encourages developers to focus solely on code and ignore the data access. This can result in an over-reliance on EF to handle all data access and SQL generation. It is often difficult to determine the efficiency of an operation simply by looking at the code, and so bloated queries due to poor LINQ structure as noted above or from EF’s SQL generation can arise. It is thus essential to look at the actual SQL generated by EF.

The simplest way to profile the SQL is to call the ToString() method on an IQueryable<> object. In the previous examples I generated the SQL by using:

var  sql = testQuery.ToString(); //Note this is only in EF 4.1 +

Beyond that, most SQL Server DBAs/developers should be familiar with SQL Profiler which would provide more indepth analytics.

Julie Lerman wrote a good roundup of EF profiling methods in Profiling Database Activity in EF (note that this was written prior to the release of EF 4.1 so the coding examples use the ObjectContext and not the DbContext which has some simpler methods such as calling ToString(0 directly on an IQueryable).

Working With Change Tracking

EF works in the background to track the changes of any in-memory entities. There are two methods by which it performs this function. Snapshot Change Tracking, keeps an in-memory snapshot of the entity when it originally sees it and compares it to the current values. Change Tracking Proxies by contrast generates a dynamic proxy for the entity at run-time which notifies EF of changes when they are made.

EF will use change tracking proxies if an entity meets all its criteria (the class must be public as well as neither sealed nor abstract, every navigation property must be marked ‘virtual’ plus several other criteria), otherwise the class will use snapshot change tracking.

The advantage of change tracking proxies is that they can automatically notify EF when a entity is updated and save the overhead of EF having to scan through numerous entities comparing them to their previous values. Thus if you have a large number of tracked entities and only a small number of entities whose values change then using change tracking proxies is likely to be very efficient.

Unfortunately, change tracking proxies can be extremely inefficient in scenarios when there are a large number of changes since each proxy has to notify EF of a change. Snapshot change tracking will only perform a single scan when DetectChanges() is triggered (this is automatically called by EF when an event such as SaveChanges() occurs).

In many scenarios the performance difference between the two change tracking methods may not be great but each should be considered when looking at performance optimisation. A scan through the questions of StackOverflow reveals that change tracking proxies does seem to cause quite a few performance and development issues as the entity model is much more restricted when using them, so I would typically default to using snapshot tracking.

When using snapshot change tracking note that change tracking can be temporarily disabled. This may be useful in performance tuning as DetectChanges() is called (and all entities in memory therefore scanned) if any LINQ query is executed on the DbContext object or when any of the below methods are executed :

DbSet.Add()
DbSet.Find()
DbSet.Remove()
DbSet.Local
DbSet.Attach()
DbContext.SaveChanges()
DbConext.GetValidationErrors()
DbContext.Entry()
DbChangeTracker.Entries()

In circumstances where there are a lot of entities being tracked in memory and when you are sure that the existing entities will not be changed it may make sense to test the performance improvement from briefly turning of change tracking:

db.Configuration.AutoDetectChangesEnabled = false;
db.Employees.Add(new Employee {name=”jude”});
db.Employees.SaveChanges();
db.Configuration.AutoDetectChangesEnabled = true;

The above code sample avoids an unnecessary call to DetectChanges for Add and SaveChanges, since we know neither of these will change the existing entities we can safely turn change tracking off while this work is performed.

Use Eager Loading Instead of Lazy Loading

Lazy loading (when EF automagically loads related entities when they are required) is convenient but leads to several performance issues. Most notable is when a related entity is called within a loop. When using lazy loading this can trigger a separate database query each time the related entity is requested :

var employees = db.Employees;
foreach(var emp in employees)
{
var address = emp.Address; //Database query is executed every time this line is hit
}

By contrast, the below code eagerly loads the related Address entity and only executes a single query on the database.

var employees = db.Employees .Include(x => x.Address) .ToList(); //All data loaded from database
foreach(var emp in employees)
{
var address = emp.Address; //No database query executed
}

Lazy loading also causes a variety of development headaches (as noted in Entity Framework Gotchas) and can be globally turned off in the constructor for the context:

public MyContext() : base() {     this.Configuration.LazyLoadingEnabled = false; }

Query In-Memory Entities First

EF keeps track of in-memory entities so retrieving an entity from memory is preferable to executing a database query. The easiest to implement example of this is to use the Find() instead of Where() when retrieving an entity by primary key. The Find() method first queries the in-memory entities and if no match is found it then executes a database query, whereas Where() will always hit the database:

var emp = db.Employees.Find(empID); // Queries in memory first
var emp = db.Employees.Where(x => x.ID == empID); //Always hits the database

It is not always convenient to do so but can manually query in-memory entities using Local if required :

var emp = db.Employees.Local.Where(x => x.ID == empID) ;

Don’t Rely on Entity Framework Exclusively For Data Access

Finally, EF is great for productivity but executing raw SQL or a stored procedure may often produce performance benefits.

In EF 4.0 and above the DbContext the SQLQuery() method to execute raw SQL:

var emps = db.Database.SQLQuery("select email from Employees");

Similarly use ExecuteSqlCommand() to execute a stored procedure:

var emps = db.Database.ExecuteSqlCommand("...");

Wrap Up

I have of course ignored the actual SQL Server database – which is a massive subject by itself, but these are the primary performance issues I encountered. Please do let me know if you think anything has been left out.

Correlation – The Need For ‘Stationary’ Data

How correlated are Intel and Google’s stock prices? The below graph shows the daily close of prices of both from mid 2009 to mid 2011.

GOOG INTL Stock Prices

From first impressions, it certainly looks like the two price series move in tandem and should have a high correlation. Indeed it turns out that correlation coefficient of the two stock price series is 0.88 , indicating a high 88% correlation between the Google and Intel stock prices.

However, this is totally misleading – in reality the correlation between the two is a mere 36%.

Correlation, in common with most time-series data analysis techniques requires ‘stationary’ data as an input. To be stationary the data must have a constant variance over time and be mean reverting. Stock price data (and many other economic data series) exhibit trending patterns which violates the criteria of stationarity. Transformation to stationary data is quite simple, however, as converting the daily price closes into daily returns will normally be sufficient. The return series of a stock is usually considered as stationary for time series analysis purposes, since it is mean-reverting (as the daily returns oscillate above and below and constant mean) and has a constant variance (the magnitude of the returns above and below the mean will be relatively constant over time despite numerous spikes).

The daily series of returns (ie percentage price changes) for both Google and Intel stocks can be seen below. Not that there is no trend to the series which moves above and below a constant mean – which for daily stock price returns is almost always very close to 0%.

GOOG INTL Stationary Series

The requirement for stationary data in calculating correlation can also be explained intuitively. Imagine you were looking to hedge a long position in Google stock with a short position in Intel, you would want the return on the Google stock to the match the return on the Intel stock. Hence correlating the prices would be irrelevant, in such a scenario you would want to know the correlation between the two sets of returns as this is what you would essentially be attempting to match with the hedge.

Correcting For Drift And Seasonality

In correlating stock price data, transforming the raw price data to returns is usually considered sufficient, however , to be more rigorous any additional trends could be stripped out of the data. Most models of stock price behaviour include the risk free interest rate plus a required rate of return as a constant drift over time – the argument being that stock investors require this return for holding the stock and over the long term the stock should deliver that return. Thus, the this return could be backed out of the series before calculating correlation. In practice, since we are dealing with daily returns, the long term drift as a minimal impact on the calculation of correlation.

Some economic data series such as durable goods orders exhibit strong effects of seasonality. When raw durable goods orders data is transformed into percentage changes, it is indeed mean reverting with a constant mean. However, the series will still not be stationary due to the strong seasonality effects – orders will be much much higher during the Christmas shopping season and so the percentage changes will always spike at the time resulting in a non constant variance.

Seasonality can be dealt with by cleaning the data series using another series which exhibits the same seasonality. In the case of durable goods orders, the raw CPI index (note: not the percentage change in CPI) would be such as series since the CPI index will typically spike during shopping seasons. Thus the durable goods orders could be divided by the CPI to arrive at a ‘deflated’ durable goods series which could then be made stationary by transforming it into percentage changes between periods.

The Myth of Card Counting

There are lots of articles info on how to count cards but v little on the returns. Ever wonder why there is so little, or why casinos are relatively unconcerned with card counting – there are card counting books in casino gift shops and a few innocuous rule changes would probably wipe it out altogether. The underlying reason is that far from being a ticket to wealth, the returns from card counting are atrocious.

Background

First why does card counting work at all? In Blackjack the dealer has the advantage of collecting a player’s bet whenever the player busts regardless of the dealers outcome. The player has two main advantages, a payout of 3/2 for Blackjack (an Ace / Ten pair) and the ability to stand on any card combination whereas the dealer must hit up to 17.

The result of this is that a deck with a heavy concentration of high value cards greatly favours the player. High cards increase the likelihood of busts, and a player can avoid these by standing on low values. For example, a player with a card combination equaling 12 facing a dealer’s 6 show card should stand as the next card is likely to bust the player, the dealer however, will have a high likelihood of busting as he will be required to take at least two cards from a deck loaded with high value cards. Thus the effect of high cards is more to bust the dealer than win the player hands. High cars also increase the chances of blackjack for the player.

It would therefore be advantageous to know when there is a high proportion of high value cards in a deck so a player can increase bets and stand on lower values.

Basic Strategy

The starting point for any blackjack counting system is basic strategy – a set of rules determining when to hit, stand, split and double-down based on the player’s cards and the dealer’s show card. This can be represented as a grid or listing of rules. This shouldn’t be a daunting task although you need to be almost flawless at this – only one error per twenty shoes is permissible.

Counting

Next is learning a counting system. Hi-Lo is the most popular system in which a value of one is added for a cards of value 6 and under, one is deducted for 10 value cards and aces, and 7,8,9 are ignored. Thus for a sequence of Jack, 8, 3, Ace , the count would be -1 (ie -1, 0, +1, -1).

This appears extremely simple but requires a great amount of effort since it needs to be executed almost flawlessly (only one or two counting errors are permitted over a six deck shoe). This typically takes several hours per day for two to three months.

This count (‘running count’) only gives the excess number of high cards over low cards, however what we need is the proportion of high cards relative the to the remaining cards. So the ‘running count’ needs to be divided by the number of remaining decks in the shoe to arrive at the ‘true count’. To do this the counter also needs to keep a count of the number of cards played (a rough approximation is usually sufficient).

Once you have mastered card counting, you need to learn the modifications to basic strategy. Since the strategy will be very different in situations where the true count is high. That’s a solid three to six months work, spending of several hours per day.

Calculating Returns

What advantage does all this effort give a card counter over the dealer? A shade over 1%. Hardly juicy, but lets work through the returns.

We will look at this as an investment and so work backwards from the bankroll. Say you have $100,000 to invest in the bankroll. Your betting unit (i.e. the amount your bet is increased for every +0.5 in the true count) should be $200 – this isn’t an exact number but to avoid the fatal blow of wiping out the betting unit 0.1% to 0.5% of the bankroll, in this example I went with 0.2%.

With perfect basic strategy and perfect card counting the expected returns will be the betting advantage multiplied by the betting unit. Thus, in this case the expected return on a hand would be 0.01 x $200 = $2.

Assuming you can play at a rate of 50 hands per hour that gives you $100 per per hour. Its possible to play more hands per hour, especially playing one-on-one with the dealer, but this increases the chance of being detected and also leads to counting errors due to the speed of play.

Next, to set up a confidence interval to estimate the distribution of returns over time. The standard deviation of a bet in blackjack is 1.1 (slightly larger than the bet size due to the increased payout in the event of blackjack). Assuming a normal distribution we can therefore say that the earnings will be the expected return (ie mean) plus or minus three standard deviations with 99.7% certainty.

The real issue for blackjack card counters is that randomness dominates until you play a very large number of hands.

Take the scenario after 100 bets :
The expected return would be $2 x 100 = 200. The standard error increases with the square root of trials therefore the standard error after 100 trials is 10 x (1.1 x 200) = 2200. Thus we can say with 99.7% certainty that after 100 hands of blackjack the return should be $200 plus or minus $6600 (or between -$6400 and +$6800).

Even 100,000 hands doesn’t provide a guaranteed return. The expected return after 100,000 hands is $200,000 plus or minus $208,560 (SQRT(100,000) x 220 x 3).

You really need to be approaching half a million hands of blackjack to be deep into positive returns. After 500,000 hands you would be 99.7% sure of having a return of $1,000,000 plus or minus $466,620. Alas 500,000 hands would take about 5000 hours or 625 days of playing 8 hours per day.

Now To The The Real World

Unfortunately the returns only get worse once you start adjusting for the real world. Card counting is actually very obvious, even an inexperienced dealer can quickly identify a counter. The reason is that the betting profile of a counter is totally different to any other player. Dealers will all have a working knowledge of basic strategy and will know a player playing decent basic strategy – which a card counter will do.

The problem is that a card counter will suddenly deviate from playing perfect basic strategy with a low bet to a high bet with major deviations from basic strategy. Some situations in particular are a major tell. Never splitting tens is not just basic strategy but also common sense, however, if when the true count and the dealer show card is a 5 or 6 then a counter must split tens.

The best way to avoid detection is to play as a team with one player as the counter who plays basic strategy but no more. Once the count is high the counter will signal the high-roller who plays a larger bet and does not follow basic strategy. This, however, vastly dilutes returns since the 1% advantage is divided between two players and one player will be playing at approximately a 0.5% disadvantage (although on a lower bet).

Given the large number of hands required and the necessity of playing with more than one player, card counters usually play in large teams. However, the meager 1% advantage over the house means there is very little return to be shared around.

Hence, most card counting teams have a hierarchy – investors/team leaders, senior player and junior players. With the players being paid a very small return for their work.

In practice the real trick of card counting is finding smart, energetic workers willing to put in six months of training (probably unpaid) and then work back-breaking shifts of 8-12 hours a day at the tables for a shade above minimum wage.

Five Reasons Not To Be On The Cloud

Hosting applications on the ‘Cloud’ was often viewed as a no-brainer, however after a mini counter revolution of companies moving applications off the cloud I thought I would compile a listing of the primary reasons to be cautious before moving your application to the cloud.

Performance

All major cloud providers use generic hardware – that is low-spec IDE drives, slow memory and old processors. As well as the low-spec servers you will be sharing system resources with other customers and virtualisation canot fully isolate users, hard drives in particular cannot be dedicated to a single user.

The result of all this is a sluggish performance. Performance issues can of course be addressed by applying several techniques such as utilising Content Delivery Networks, employing a variety of caching techniques and aggressive front-end optimisation, but the issues remains that out-of-the-box a cloud setup will almost certainly be less responsive and performant than a dedicated setup.

Another persistant issue is variable performance. Cloud customers repeatedly report performance ‘black holes’ when system response times sudden dip for several hours for no apparent reason. The cause of this is likely a ‘noisy neighbour’ (ie another cloud user hogging resources before being shutdown) or system maintenance/updates, but it remains a constant frustration as it is impossible to predict when performance will suddenly drop.

Cloud providers usually quote the CPU in standard units which are not comparable to the cores on CPUs of modern servers. Amazon Web Services, for example, quotes a ‘Compute Unit’ as equivalent to an “early-2006 1.7 GHz Xeon processor”. Thus even a Large Instance with 4 compute units will be significantly inferior to a 2012 quad-core processor. Adding more cloud instances (ie servers) or increasing the size of the instance may seem a solution to this, however this will only be a solution if the application can take advantage of running work in multiple threads. Web sites are usually good candidates for splitting work out across multiple threads (as each request for a page can handled by a different thread). However, more processor intensive work such as statistical analysis require significant development effort to efficiently run on multiple threads.

Cost

The issue isn’t the high cost of cloud hosting, it is that it is so difficult to estimate. For starters it is difficult to purchase a server with specs resembling the ‘instances’ offered by cloud providers. As noted above, cloud providers spec their offerings with CPU ‘units’ corresponding to a five or six year-old processor. How to compare these CPU units with what will be offered on a dedicated server today? At a guess I’d say 8 compute units would be the equivalent a quad-core Intel ‘Ivy Bridge’ Xeon – but it is impossible to know without benchmarking the actual application you plan to run.

A dedicated host server bundled with most of the features needed to get an application live – storage, a bandwidth allocation, IP addresses etc. Cloud providers have an a-la-carte menu of a myriad of options and charges so it is difficult to estimate the final monthly cost. There are several third-party services such as the Cloud Comparison Tool for estimating cloud costs.

An important cost which is often overlooked is support. For a regular provider of dedicated physical servers support is included in the price (although premium support is additional) , however most cloud providers offer no support and charge for even the basic email support.

A major benefit of the cloud is that testing and development servers can be spun up and down on-demand and so minimise the cost of having servers dedicated to these functions which only uses the server resources for a small portion of the month. In practice, cloud servers often get spun up and then forgotten about, silently running up the cloud bill. Cloud providers offer only the minimal tooling to track and prevent this and so purchasing third party monitoring tools is often essential.

Scaling

The cloud is often touted as a packaged solution to scaling out applications, in reality most cloud solutions currently provide little in the way of in-built scaling beyond an easy method to spin up servers on-demand. All the tricky scaling issues such as maintaining user state remain as these are part of the application architecture and have little to do with the infrastructure the cloud provides.

On the contrary the cloud can often force developers to handle scaling issues before they would otherwise need to. Both Amazon and Microsoft require that a minimum of two instances to be running before their SLA (Service Level Agreement) becomes valid. In addition virtual severs on clouds are brittle and will regularly be destroyed and automatically rebuilt. The result of this is that an application needs to be designed for running across multiple servers from the moment it lands on the cloud. It could be a benefit to force developers to consider multi-server scaling early but it is worth noting that a decent specified server can cope with an impressive load – a content focused site such as a news/magazine site should be able to comfortably serve 10 million page views a month from a single physical server.

Support

Your site is down and it looks like a server issue, so who to call? For a lot of cloud providers there is no-one and you are left to the online forums to troubleshoot issues. Amazon (AWS) acquired a notorious reputation for referring all but the largest customers (who payed a princely $15k per month for support) to their online forums for assistance. The only worthwhile support from AWS starts at 10% of the monthly bill. A notable exception to this is RackSpace who provide 24/7 online (IM/Chat) support at no charge.

Support is also trickier with cloud services. A physical server is isolated from other users and identifying issues is comparatively straightforward. If an application is running on the cloud it can be difficult to distinguish between an issue caused by the underlying network or the application itself.

One Size Fits All

Different types of applications have very different hardware requirements. A Content management System which powers a news website would not require a lot CPU resources but would benefit from a large amount of memory (since a massive performance gain can be had from keeping the content in memory as opposed to on disk). By contrast a statistical analysis app would be a heavy user of processing power.

Specifying servers for different user cases is possible when physical servers are purchased/rented as each server can be custom configured with suitable processors and memory. Cloud providers only provide a small set of generic configurations designed, AWS is the only major provider to offer instances targeting both processor and memory intensive applications but it is a limited range with no opportunity for customization.

Cloud providers built their networks using only generic commodity hardware and so do not offer access to cutting edge hardware. A current issue is the access to SSD drives which have been available on physical severs for several years but is still not available for standard cloud setups. This performance benchmark shows the enormous benefit of hosting a database on an SSD.

Entity Framework Gotchas

Entity Framework is fast becoming the default data access method for small to medium sized apps in .NET. Whilst it offers a lot of benefits in terms of productivity and security, working with EF is often very different from working with the standard ADO.NET data-access objects and poses several common issues or ‘gotchas’:

Using Different DBContext objects

This is probably the most common ‘gotcha’ when starting development with EF. When an entity is loaded using a DBContext object it can only be tracked by that DBContext object and attempting to perform work on the entity using a different DBContext object will give rise to a variety of errors.

For example:

        public void LoadUser(AppUser user)
        {
           var db = new FilingContext() ;
           user = db.AppUsers.Find("username");
           UpdateUser(user);
        }
        private void UpdateUser(AppUser user)
        {
           var db = new FilingContext();
            user.Email = "newEmail";
            db.Entry(user).State = EntityState.Modified;
           db.SaveChanges();
        }

This will raise the error:

An entity object cannot be referenced by 
multiple instances of IEntityChangeTracker.

This is a very simplistic example (especially since opening two contexts would be a bad idea for performance reasons) but as the model and data access code grows this issue can often arise.

It is normally best practice in development to limit each method to performing a single task. However, as in the above case, if an entity is created and then passed to another method for additional processing an error will be throw as it is being tracked by more than one DBContext.

The cleanest solution to this is to use AsNoTracking() when loading an entity and then later manually Attach the entity to the DBContext object. This informs EF that the change tracking features are not required and hence the DBContext object does not need exclusive control over the object:

        public void Test(AppUser user)
        {
           var db = new FilingContext() ;
           user = db.AppUsers.AsNoTracking()
                    .Where(x => x.UserName == "userName").Single();
           UpdateUser(user);
        }
        public void UpdateUser(AppUser user)
        {
           var db = new FilingContext();
           db.AppUsers.Attach(user);
           user.Email = "newEmail";
           db.SaveChanges();
        }

As a general point you should consider using AsNoTracking() when loading entities which you know will not be modified, as there is a performance overhead in the DBContext having to track the changes of entities.

Lazy Loading

Lazy loading is when EF determines which related entities need to be loaded and automatically loads these at run-time. For example, if an Employee object contained a Job entity you would not explicity have to request the loading of the Job entity when loading the Employee – just attempt to access the Job as a property of the Employee and EF will automatically load the Job.

Lazy loading looks extremely efficient and convenient but in practice it is totally impractical for most applications. It is a shame that so many of the code samples from Microsoft feature lazy loading, and hence developers typically start off by relying on it. However, to use lazy loading the DBContext which loaded the object needs to still be available and if any form of a Data Access Layer is implemented it has likely been destroyed.

Most experienced EF users recommend wrapping the DBContext object in a using block to ensure it is destroyed after performing its work, in such a circumstance lazy loading could only be used with the using block.

The most common head-scratcher for beginners is that lazy loading is turned off for validation. This leads to run-time errors which an entity has related entities which are marked as [Required] as below:

  public class Employee
        {
        public virtual int ID { get; set; }
        public virtual string Name { get; set; }
        [Required]
        public virtual  Job  Job { get; set; }
        }
  public class Job
        {
        public virtual int ID { get; set; }
        public virtual string JobName { get; set; }
        }
......


var emp = db.Employees.Find(primaryKey); 
emp.Name = "New Name";
db.SaveChanges(); //throws an error at run-time

The above sample looks fine at first sight (and more importantly compiles without errors), but it throws a runtime error since the SaveChanges() method calls into EF Validation which notes the Required entity is not loaded and throws an error.

The simple solution to this is to an an Include statement to explicitly load the required entity:

var emp = db.Employees.Where(x => x.ID == primaryKey)
                      .Include(x => x.Jobs);

In general a good alternative to marking entities as [Required] is to expose a non-nullable foreign key in the model as below. This avoids the common scenario above when EF throws an exception on validation due to a Required entity not being loaded.

  public class Employee
        {
        public virtual int ID { get; set; }
        public virtual string Name { get; set; }
        public virtual  int JobID { get; set; }
        public virtual  Job  Job { get; set; }
        }

SQL Generation From LINQ

LINQ is extremely powerful for querying objects, however, SQL is less so and since EF has to translate LINQ into SQL when a database is queried not all LINQ queries can be successfully executed.

One notable issue is using methods on the DateTime object. The Add() as well as the ToShortDateString()/ToLongDateString() methods of DateTime are commonly used in LINQ however there is no equivalent in SQL. Using these will therefore cause run-time errors. To rectify this you should use methods such as these outside of the LINQ statement.

//Below code causes an error as AddDays can't be translated to SQL for the query.
var employees = db.Employees.Where(x => x.CreateDate > DateTime.Now.AddDays(-7)).ToList();
//Below code executes fine as AddDays is outside the LINQ query.
var oneWeekCutoff = DateTime.Now.AddDays(-7);
var employees = db.Employees.Where(x => x.CreateDate > oneWeekCutoff ).ToList();

Another common issue is using the Contains() method for checking if an object which implements IEnumerable (such as a List) contains another object. For example this method checks if a user is already stored:

  public bool UserExists(AppUser user)
  {
   using (var db = new MyContext())
     {
       return db.AppUsers.Contains(user);
     }
      }

This will compile fine, but will throw the below error at runtime:

Unable to create a constant value of type 'AppUser'. 

Only primitive types or enumeration types are 
supported in this context.

The reason is that LINQ-to-entities cannot translate Contains into a SQL Statement when an object is passed in.

A common solution to this is to use the Any() method and use a lamba to check for if any objects in the context have the primary key of the object you are testing for:

  public bool UserExists(AppUser user)
{
using (var db2 = new FilingContext())
 {
 return var bool1 = db2.AppUsers.Any(x => x.ID = user.ID);
 }
 }

Automatic Database Creation

EF can automatically update the database or create it if one does not exist – at least that is the headline. It would be more accurate to say that EF can automatically create or update the database when the first attempt is made to access the database.

This distinction can lead to a lot of confusion when a developer creates or updates the EF model, then loads the home page and checks the database which is mysteriously unchanged. The solution is to load a page which attempts to access some of the EF entities, this will then trigger the database creation/update.

Even adding an initializer in Application_Start() as below will not change this behaviour since this only specifies the strategy for when the first access attempt is made on the database.

protected void Application_Start()
{
Database.SetInitializer(new CreateDatabaseIfNotExists());
//....
}

To trigger creation of the database upon startup, create an instance of your Context and then run its Initialize method:

protected void Application_Start()
{
    Database.SetInitializer(new
            CreateDatabaseIfNotExists());

    var db = new YourContext();
     db.Database.Initialize(true);
//....
}

Using EF with ASP.NET Membership

This is more of a headache than a ‘gotcha’ since EF makes no pretense of supporting or implementing the commonly used ASP.NET Membership system. It does however help to know that there is no easy solution to this problem – one very bad idea is to have the ASP.NET Membership tables in the same database as the EF entities since the EF database is likely to be dropped and recreated.

If you are looking to have a Foreign Key in an EF entities pointing at note that the primary key for users in ASP.NET Membership is the UserID not UserName and can be programatically accessed as below:

  MembershipUser currentUser = Membership.GetUser();
 string currentUserID = currentUser.ProviderUserKey.ToString();

Also note that having a ASP.NET Membership foreign key in your EF model leaves your app vulnerable to any changes Microsoft might make to the ASP.NET Membership schema in future as noted in this critique which alas doesn’t provide any clean solutions (since there really are none!).

There are a lot unique issues in EF, but these are the ones that I find are most frequently encountered when getting started using EF. Let me know in the comments if there are any other common gotchas you have come across.

Using Entity Framework With An Existing SQL Server Database (Code Second)

Entity Framework originally shipped with two modes – Model First which allowed for the entity model to be created using the designer UI and Database First which created the EF model from a database.

EF 4.1 introduced Code-First development, enabling developers to simply generate the EF model and database directly from code. This method rapidly became the most popular method for using EF, since it is relatively unobtrusive and developers can simply write POCOs (Plain Old CLR Classes) and have these seemlessly persisted to SQL Server.
Code-First did not however reverse engineer POCOs from a database schema.

The workaround was to use Database First to generate the edmx files and then generate the POCOs from the edmx files.
With Entity Framework Power Tools it is possible to reverse engineer a POCO model from an existing database (sometimes called ‘Code Second’). Simply install EF Power Tools (download here).

Prior to generating the POCOs you may wish to customize the code generation. This step is best skipped if this is the first time you are working with EF for reverse engineering code.
To customize the code creation, open the project you wish to use with EF, right click on the project name and select Entity Framework > Customize Reverse Engineer Templates:

This option does not perform any mode generation, but adds the T4 templates which are used for the model generation to you project. The templates are added in the Code Templates > ReverseEngineerCodeFirst folder. There are three T4 template, one for customizing the mappings, one for the context and one for the entities themselves. The mostly likely candidate for customization is the Entity.tt file for generating the entities, the Context.tt will just generate the context and this will be easy to edit it after creation, if the mappings need editing it will probably be on a case-by-case basis and so post generation editing would be fine.

In the Entity.tt file you can, for example, edit the constructors for the entities – by default the generator adds code to initialize all ICollection objects to new Lists to avoid run-time errors when attempting to access a list with no added entities, but you may wish to modify this or add custom code in the constructor:

One thing I like to do is remove virtual from all properties in an entity. This will prevent lazy loading (see Entity Framework Gotchas for some of the issues with lazy loading) and set change tracking to Snapshot Change Tracking which is more straightforward to work with.

Next, to generate the POCOs. Right-click on the project and select Entity Framework > Reverse Engineer Code First:

Then simply select the server and database from which to generate the classes using the familiar connection / database dialog:

Click OK to start the model generator and the model classes should now appear in your project in the Models folder:

The POCOs will each be created in a separate file, in addition each entity will have an associated mapping information file in the Mappings folder describing how the properties in the entity will relate to eachother and to other entities in the database. When EF creates the database the mappings files will be used to create the schema, if no mappings files exist then EF can still create the database but only using its default object mappings. Note that the mapping files are not required to use EF code first (and you could avoid them being generated by deleting the Mapping.tt file prior to running the model generation but they are very useful to have as they will provide a great starting point for customizing the model later).

The context class itself may require some attention. The context class inherits from the EF DbContext class and all access to the database will be through a method exposed by this class or one of its DBSet properties (a DBSet corresponds to a table in the database).

By default the context class will be given the name of the database with ‘Context’ appended – ie [yourDBName]Context.cs . The first thing to note in the context class is the Initializer. By default the initializer is set to null meaning that EF will not automatically recreate or modify your database in the event of a change in the model:

        static yourDBNameContext()
        {
            Database.SetInitializer<yourDBNameContext>(null);
        }

This is a very normal setup since EF Migrations does an excellent job of handling the updates to the database. However, for rapid development at the start of a project it is very efficient to have EF automatically drop and recreate the database without generating un-necessary migration files, if so you can choose to set this to DropCreateDatabaseIfModelChanges as below. There are several other less useful initialization options such as DropCreateDatabaseAlways which drops and recreates the database everytime the app is loaded.

Whether you are using EF Migrations or DropCreateDatabaseIfModelChanges you may want to consider a Seed method. The seed method runs when the database is created or updated to add some data which may be required for development and testing.

        protected override void Seed(yourDBNameContext db)
        {
            //Hard code data to be saved to database on db creation
            db.Employees.AddOrUpdate(
              x => x.FullName,
              new Employee { FullName = "Jude OKelly" },
              new Employee { FullName = "Diego Maradona" }
            );
        }

Note above the use of the AddOrUpdate method which checks if the record exists previous to adding it.

In the Context class also note the call to the base constructor which passes in the connection string:

public yourDBNameContext()
: base("Name=yourDBNameContext")
{
}

There will be an entry in the <connectionStrings> node web.config for the actual string.

That’s all that is required. Overall it is a very simple process, and even with a bit of customization and testing it should take less than half hour to have a fully functioning EF model.