Business Intelligence, CodeProject, OLAP, Posting Code, Software

Exploring Data Cubes On NSimpleOlap (Alpha)

The goal of this article is to give a quick run down of the current features of the NSimpleOlap library, this is still in development with new features being added and with its API still undergoing refinement.

NSimpleOlap is an OLAP engine, more precisely an in-memory OLAP engine, and it’s intended for educational use and for application development. With simple to use API to allow for easy setup and querying, which I hope will make it easier to demonstrate the usefulness of OLAP engines in solving certain classes of problems.

This library will be provided as is and with no front-end, it is not directed towards use in finance, fancy dashboards or regular Business Intelligence use cases. In which over-hyping and exaggerated licensing fees, have in my opinion limited the scope of use of these systems and undermined their acceptance. Limiting their use to the BI silo that is often only for the eyes of business managers.

But it can be much more…

Starting With The Demo Application

I will start with the demo application, it makes for easier presentation and exploration of the main concepts required to query a data cube. This is a simple console application, a throwback in time for those that are more graphically minded, but it will do for the purposes of this article.

You can find the Demo application and the NSimpleOlap library following this link:

https://github.com/calexconc/NSimpleOLAP

The seed data in this demo application is a very basic flat file CSV file, although the base dimensions are referenced as numeric ids that have a correspondent in a supporting CSV file.

category, gender, place, Date, expenses, items
1, 1, 2, 2021-01-15,1000.12, 30
2, 2, 2, 2021-03-05,200.50, 5
4, 2, 5, 2021-10-17,11500.00, 101
3, 2, 2, 2021-08-25,100.00, 20
2, 1, 6, 2021-02-27,10.10, 5
1, 2, 2, 2021-08-30,700.10, 36
5, 2, 5, 2021-12-15,100.40, 31
1, 1, 3, 2021-09-07,100.12, 12
3, 2, 3, 2021-06-01,10.12, 30
2, 2, 2, 2021-06-05,10000.12, 30
1, 2, 1, 2021-05-04,100.12, 1
4, 2, 2, 2021-01-03,10.12, 6
2, 2, 3, 2021-11-09,100.12, 44
1, 2, 3, 2021-07-01,10.12, 8
4, 1, 1, 2021-04-24,100.12, 5
1, 1, 6, 2021-06-02,10.12, 7
4, 3, 6, 2021-05-18,100.12, 30
2, 1, 2, 2021-08-21,60.99, 8
1, 2, 2, 2021-02-16,6000.00, 89
4, 3, 6, 2021-03-07,600.00, 75
1, 1, 6, 2021-01-01,10.00, 12
4, 2, 2, 2021-07-28,2000.00, 30
5, 2, 6, 2021-12-20,50.10, 11
3, 1, 3, 2021-06-08,130.50, 2

Executing the demo application will show the following initial console messages.

You can type help to get a basic example of how you can make simple queries, and get the available dimensions and measures.

You can type a simple query, and get the outcome once you hit enter.

As you can see the results aren’t chronologically ordered, but in the order the cells were picked up by the query engine. This will be resolved once order selection is implemented.

Here’s another example.

And another example, but now focusing in the records that there is no data for gender.

As you can see some of the outputs have many empty spaces, because the test data isn’t very big. So in terms of the space of all available aggregations the current data cube is very sparse. But you can still view the data through different perspectives and have an idea of what is possible.

Starting Your Own Cube

At this stage of development you can define dimensions, measures and metrics. Being that you can define regular dimensions that define lists of attributes or entities (colour, gender, city, country, etc.), or define Date dimensions that need to be handled differently. Since these follow defined calendar patterns and need to be generated from the incoming data in the facts tables.

Measures are variables that were observed from the entities that are defined in the facts table, these can be quantities of goods sold or bought, value or price of goods, total value of invoice, temperature, rainfall, etc.. These will be aggregated inside the cube in various combinations, although this will entail a certain loss of context. Since the aggregated cell that resulted from multiple data points won’t tell much about the pattern of the input data. But a cube is about exploring the forest and not about the individual trees.

Metrics are expressions that are calculated at aggregation time, and these allow to make some extra calculations as well as to keep some extra data context in the cell. These calculated values can be averages, minimum and maximum values, or any expression made with a composition of the implemented operations.

Setting Up Regular Dimensions

When adding new dimensions you will need to initially setup your facts data source. In this particular example we will need to specify a CSV file and add the fields from the file that we want as sources for your Cube. Also, you will need to specify the data source that has the dimension member. Which will have the column that will be used as an id and the column that will be used as the dimension member name.

CubeBuilder builder = new CubeBuilder();

builder.AddDataSource(dsbuild =>
        {
          dsbuild.SetName("sales")
            .SetSourceType(DataSourceType.CSV)
            .SetCSVConfig(csvbuild =>
            {
              csvbuild.SetFilePath("TestData//facts.csv")
                              .SetHasHeader();
            })
            .AddField("category", 0, typeof(int));
        })
        .AddDataSource(dsbuild =>
        {
          dsbuild.SetName("categories")
            .SetSourceType(DataSourceType.CSV)
            .AddField("id", 0, typeof(int))
            .AddField("description", 1, typeof(string))
            .SetCSVConfig(csvbuild =>
            {
              csvbuild.SetFilePath("TestData//dimension1.csv")
                              .SetHasHeader();
            });
        });

Then you will need to map the columns in your fact data source with your cube dimensions.

builder.SetSourceMappings((sourcebuild) =>
        {
          sourcebuild.SetSource("sales")
            .AddMapping("category", "category");
        })

And then add the metadata mappings from dimension members data sources.

builder.MetaData(mbuild =>
        {
          mbuild.AddDimension("category", (dimbuild) =>
          {
            dimbuild.Source("categories")
              .ValueField("id")
              .DescField("description");
          });
        });

Setting Up Measures

Getting a measure into a cube requires only two steps, first map the measure column from the facts data source.

builder.AddDataSource(dsbuild =>
        {
          dsbuild.SetName("sales")
            .SetSourceType(DataSourceType.CSV)
            .SetCSVConfig(csvbuild =>
            {
              csvbuild.SetFilePath("TestData//tableWithDate.csv")
                              .SetHasHeader();
            })
            .AddField("category", 0, typeof(int))
            .AddField("expenses", 4, typeof(double));
        })

And then add the measure metadata mapping for the cube.

builder.MetaData(mbuild =>
        {
          mbuild.AddDimension("category", (dimbuild) =>
          {
            dimbuild.Source("categories")
              .ValueField("id")
              .DescField("description");
          })
          .AddMeasure("spent", mesbuild =>
          {
            mesbuild.ValueField("expenses")
              .SetType(typeof(double));
          });
        });

Setting Up Date Dimensions

Adding a Date dimension will add an extra layer of complexity, since you will need to specify what kind of Date levels you want the data to be sliced into.

You will start with mapping the Date field and in this case specify the date time format that it was set on the CSV file.

builder.AddDataSource(dsbuild =>
        {
          dsbuild.SetName("sales")
            .SetSourceType(DataSourceType.CSV)
            .SetCSVConfig(csvbuild =>
            {
              csvbuild.SetFilePath("TestData//tableWithDate.csv")
                              .SetHasHeader();
            })
            .AddField("category", 0, typeof(int))
            .AddDateField("date", 3, "yyyy-MM-dd")
            .AddField("expenses", 4, typeof(double));
        });

Add the mapping to the data source and indicate what label fields you want.

builder.SetSourceMappings((sourcebuild) =>
        {
          sourcebuild.SetSource("sales")
            .AddMapping("category", "category")
            .AddMapping("date", "Year", "Month", "Day");
        })

When defining the dimension metadata specify the dimension labels and the type of information the data will be transformed. In this this case you will have three dimensions: Year, Month and Day.

builder.MetaData(mbuild =>
        {
          mbuild.AddDimension("category", (dimbuild) =>
          {
            dimbuild.Source("categories")
              .ValueField("id")
              .DescField("description");
          })
          .AddDimension("date", dimbuild => {
            dimbuild
            .SetToDateSource(DateTimeLevels.YEAR, DateTimeLevels.MONTH, DateTimeLevels.DAY)
            .SetLevelDimensions("Year", "Month", "Day");
          })
          .AddMeasure("spent", mesbuild =>
          {
            mesbuild.ValueField("expenses")
              .SetType(typeof(double));
          });
        });

Setting Up Metrics

At the moment metrics can only be set after the Cube is initialized and not at configuration time, since that will require parsing text expressions. But you can still add metrics using the expression building API.

Setting up a metric will require you to identify what measures you want to use, and what maths operations are necessary to build it. As a simple example…

var cube = builder.Create<int>();
cube.Initialize();

cube.BuildMetrics()
    .Add("Add10ToQuantity", exb => exb.Expression(e => e.Set("quantity").Sum(10)))
    .Create();

This won’t do much to further the understanding of the data but it’s a start.

For more useful expressions you can also combine two measures and get rates and ratios.

cube.BuildMetrics()
    .Add("RatioSpentToQuantity", exb => 
     exb.Expression(e => e.Set("spent").Divide(ex => ex.Set("quantity").Value())))
    .Create();

Or use some useful functions and retain some context from the source data.

cube.BuildMetrics()
        .Add("AverageOnQuantity",
          exb => exb.Expression(e => e.Set("quantity").Average()))
        .Add("MaxOnQuantity",
          exb => exb.Expression(e => e.Set("quantity").Max()))
        .Add("MinOnQuantity",
          exb => exb.Expression(e => e.Set("quantity").Min()))
        .Create();

Getting More With Queries

A data cube is nothing if it cannot be queried, the NSimpleOlap fluent query API borrows many concepts from the MDX query language. You will need to get familiarized to specify your rows and columns as tuples. In general that is no different as setting paths or using something like xpath in XSL or any DOM XML API. You are not only slicing the cube but you are also defining what data hierarchies you want to visualize.

Defining a simple query and sending the output to the text console.

cube.Process();

var queryBuilder = cube.BuildQuery()
    .OnRows("category.All.place.Paris")
    .OnColumns("sex.All")
    .AddMeasuresOrMetrics("quantity");

var query = queryBuilder.Create();

query.StreamRows().RenderInConsole();

|                                | sex male  | sex female
    category toys,place Paris    |     12     |      8
 category furniture,place Paris  |     2      |      30
  category clothes,place Paris   |            |      44

You can also select both measures and metrics at the same time in a query.

var queryBuilder = cube.BuildQuery()
    .OnColumns("sex.All")
    .AddMeasuresOrMetrics("quantity", "MaxOnQuantity", "MinOnQuantity");

var query = queryBuilder.Create();
var result = query.StreamRows().ToList();

Making filters on the aggregate values and the facts is also possible. First we will filter on the aggregates.

var queryBuilder = cube.BuildQuery()
    .OnRows("category.All.place.All")
    .OnColumns("sex.All")
    .AddMeasuresOrMetrics("quantity")
    .Where(b => b.Define(x => x.Dimension("sex").NotEquals("male")));

var query = queryBuilder.Create();
var result = query.StreamRows().ToList();

Then we will reduce the scope of the data by filtering on a measure.

var queryBuilder = cube.BuildQuery()
    .OnRows("category.All.place.All")
    .OnColumns("sex.All")
    .AddMeasuresOrMetrics("quantity")
    .Where(b => b.Define(x => x.Measure("quantity").IsEquals(5)));

var query = queryBuilder.Create();
var result = query.StreamRows().ToList();

Making filters on the facts will generate a cube with a smaller subset of data. This makes sense since the main Cube doesn’t have the full context of the facts, and any operation that requires digging on the source facts will require generating a new Cube to represent those aggregations.

In Conclusion…

The NSimpleOlap core is getting more stable and it’s already possible to query on complex hierarchies of dimensions. But there is still much to do, getting Time dimensions, adding dimension levels through metadata, transformers to convert measure data into interval dimensions to be able to query age ranges, etc.. Also, some more work is required to have a structure to enable better rendering of row and column headers in a hierarchical structure. Much to do, and so little time…

Business Intelligence, CodeProject, OLAP, Software

Presenting NSimpleOlap (Alpha & Unstable)

NSimpleOlap is a project that I started in 2012 with the goal of building a stand-alone embeddable .Net OLAP library, that can be used within the context of console, desktop, or other types of applications. And my initial motivation for starting this project was that at the time there weren’t that many lightweight Open Source implementations. Or the implementations that suited my preferences were too expensive, or that would only exist as server solutions, etc..

In my previous professional path building tools for Marketing Research I was exposed to many of the tropes of what it’s called Analytics, and that gave me some understanding of the basics of Business Intelligence. Even after leaving the Marketing Research business I still kept an interest in the subject and the tools of the trade. And I researched the market for tools that had similar business cases, like survey and questionnaire applications, and OLAP and BI Server solutions. Some products struck a cord with me like Jedox, Pentaho, JasperReports, and I even dabbled on Microsoft SQL Server Analysis Services. But these were not the products I was looking for.

Since my Interests had shifted, I wanted a OLAP engine that could be used within the context of an application and that could do aggregations quickly on a limited dataset, or in real-time but with some caveats. And although it’s true that at the time there were some analytics solutions, like Tableau, that provide a full range of data, reporting and UI tools, and some real-time features. In 2012 I decided to start this project.

The project in the beginning of 2012 was actually evolving very quickly, unfortunately a personal mishap derailed everything. And for professional and personal reasons I wasn’t able or motivated to restart development on the project. But out of frustration and disillusionment with the way technical skills are evaluated I decided take a chance and get the project into a releasable state. And it’s my intention that this project will help to educate more developers on the utility of aggregation engines beyond the field of Business Intelligence and Finance.

At a personal level I am quite frustrated with the way interviews for developer roles are done, and how technical skills are evaluated, and all the selection process. From the box ticking, to questions about algorithms and data structures that are rarely or never used, or the online gamified technical tests, the code challenges that require several days of full time work (and that are suspiciously like real world problems), the bait and switch, etc.. And that is just the recruiting process, the actual work itself very often provides very little in terms of career growth. Being that in some cases, people that you work with have an incentive to devalue your skills, steal your ideas or just take advantage of you. Also, it’s annoying as hell to have to watch the constant manhood measurement contests, the attention seeking narcissists, the friendly backstabbers, and the occasional incompetent buffoon.

Well, that is out of my chest… Rant over.

The Project

At the present moment the NSimpleOlap project is still in alpha stage and unstable. And at the moment will only allow for some basic querying and limited modes of aggregation. Being that some of its features are still experimental, and are implemented in a way to allow for easy testing of different opportunities for optimization or/and feature enhancement. You can find it by going to the following Github repository:

https://github.com/calexconc/NSimpleOLAP

At the conceptual level NSimpleOlap borrows a lot from the MDX query language, the model of the Star Schema, and modelling and mapping conventions that are common in the modelling of data Cubes. As and example tuples and tuple sets are the way you can locate Cube cells, and can be used to define what information comes in rows or in columns. Examples of tuples are as follows:

Category.Shoes.Country.Italy
Year.2012.Products.All
Gender.Female.State.California.Work.IT

There are some concepts that you will need to be familiar so you can use this library:

Dimension – This is a entity or characteristic of your data points, it can be a socio-demographic variable like gender, age, region, etc., or product name, year, month etc..
Dimension Member – This a member of a dimension, in the case of gender an example would be “female”.
Measure – This is a variable value from your data points, it can be the sale value, number of items bought, number of children, etc..
Metrics – This is a value that is calculated from the aggregated results, can be an average, a percentage, or some other type of complex expression.

To be able to populate the Cube you will need to organize your data in a table that has all the facts, where the dimension columns have numerical keys, and that you have those keys and relevant metadata in separate dimension definition data sources.

Building Your First Cube

Building a Cube will require some initial setup to identify the data sources, mappings and define the useful metadata. In the following example we will build a Cube from data that is contained in CSV files, and these will be used to define the Cube dimensions and measures.

CubeBuilder builder = new CubeBuilder();

builder.SetName("Hello World")
.SetSourceMappings((sourcebuild) =>
{
  sourcebuild.SetSource("sales")
  .AddMapping("category", "category")
  .AddMapping("sex", "sex"));
})
.AddDataSource(dsbuild =>
{
  dsbuild.SetName("sales")
  .SetSourceType(DataSourceType.CSV)
  .SetCSVConfig(csvbuild =>
  {
    csvbuild.SetFilePath("TestData//table.csv")
    .SetHasHeader();
  })
  .AddField("category", 0, typeof(int))
  .AddField("sex", 1, typeof(int))
  .AddField("expenses", 3, typeof(double))
  .AddField("items", 4, typeof(int));
})
.AddDataSource(dsbuild =>
{
  dsbuild.SetName("categories")
  .SetSourceType(DataSourceType.CSV)
  .AddField("id", 0, typeof(int))
  .AddField("description", 1, typeof(string))
  .SetCSVConfig(csvbuild =>
  {
    csvbuild.SetFilePath("TestData//dimension1.csv")
    .SetHasHeader();
  });
})
.AddDataSource(dsbuild =>
{
  dsbuild.SetName("sexes")
  .SetSourceType(DataSourceType.CSV)
  .AddField("id", 0, typeof(int))
  .AddField("description", 1, typeof(string))
  .SetCSVConfig(csvbuild =>
  {
    csvbuild.SetFilePath("TestData//dimension2.csv")
             .SetHasHeader();
  });
})
.MetaData(mbuild =>
{
  mbuild.AddDimension("category", (dimbuild) =>
  {
  dimbuild.Source("categories")
    .ValueField("id")
    .DescField("description");
  })
  .AddDimension("sex", (dimbuild) =>
  {
  dimbuild.Source("sexes")
    .ValueField("id")
    .DescField("description");
  })
  .AddMeasure("spent", mesbuild =>
  {
  mesbuild.ValueField("expenses")
    .SetType(typeof(double));
  })
  .AddMeasure("quantity", mesbuild =>
  {
  mesbuild.ValueField("items")
    .SetType(typeof(int));
  });
});

Creating the Cube will require you to make the necessary method calls so the data will be loaded and processed. And this can be done as follows.

var cube = builder.Create<int>();

cube.Initialize();
cube.Process();

Querying The Cube

Querying the Cube can be done by using the querying interface, here’s a basic example:

var queryBuilder = cube.BuildQuery()
  .OnRows("sex.female")
  .OnColumns("category.shoes")
  .AddMeasuresOrMetrics("quantity");

var query = queryBuilder.Create();
var result = query.StreamCells().ToList();

In the previous example you streamed the results by cells, but you can also stream by rows:

var result_rows = query.StreamRows().ToList();

You can also add some basic expressions to filter on the table facts, this will limit the scope of the rows that will be aggregated.

var queryBuilder = cube.BuildQuery()
  .OnRows("sex.All")
  .OnColumns("category.All")
  .AddMeasuresOrMetrics("quantity")
  .Where(b => b.Define(x => x.Measure("quantity").IsEquals(5)));

var query = queryBuilder.Create();
var result = query.StreamCells().ToList();

Or you can add some basic expressions to filter on dimension members, which won’t affect the scope of the aggregated results.

var queryBuilder = cube.BuildQuery()
  .OnRows("sex.All")
  .OnColumns("category.All")
  .AddMeasuresOrMetrics("quantity")
  .Where(b => b.Define(x => x.Dimension("sex").NotEquals("male")));

var query = queryBuilder.Create();
var result = query.StreamCells().ToList();

Concluding & Hopes For the Future

In conclusion, there is still a lot of work to be done to have the sets of features like dimension levels, Date and Time dimension transformers, query expressions, etc.. Hopefully these features will be coming in the near future.

Business, Marketing, Platforms, Software

From Traditional Marketing Research to Surveillance Marketing

Since the days of the first branded mass produced goods, companies have used ads in the form of posters, outdoors and ads in newspapers. That evolved into Radio jingles, TV advertisement shorts and the current form of targeted ads that follow you through each site and platform.

These ubiquitous siren calls usually feel much like a low level annoyance we have to endure. But they are there because of a symbiotic relationship between media, advertisers and businesses. The mass media companies in their business model cannot rely solely on paid subscriptions or sales of newspapers to survive. And for TV broadcasters there isn’t that sort of option (except for the cases there is a TV licensing tax or premium cable channels). So they need to depend on ad placement in their media channels to survive and make a profit.

On the other hand, businesses need to increase and maintain the visibility of their brands, products and services. And, to make potential customers aware of their offerings. And, to increase the probability that their products will be picked up instead of those of their competitors. To enable that, there is an ecosystem of companies that produce ads, manage marketing campaigns, manage ad placements and that evaluate the campaign effectiveness.

“Half the money I spend on advertising is wasted; the trouble is I don’t know which half.” – John Wanamaker.

Marketing effectiveness has been one of the main concerns for marketeers, and one of the main drivers for gathering as much information as possible about consumers and audiences. So, that it is possible to understand the effect of a campaign or find ways to increase the effectiveness of the ad message to get closer to a sale.

Tools of Marketing Research

Statistics, surveys and questionnaires are an important part of the marketing research tool set, and these are important to:

Know and understand the features of the market, like total size, geographic distribution, age distribution, purchasing power, etc..
Select a target group to whom to direct marketing campaigns.
Understand the selected group’s preferences and behaviours and how they can be used to market products.
Understand if the marketing campaign is effective and if the core message of the campaign is associated to the brand / product. And if not why.

To be able to do their job, the marketing analytics operators rely on things like statistical censuses. These are considered to be valuable tools for governments, but also for marketing. Since it is based on the census data that it’s possible to know the base socio-demographic data of a country and region. Things like the size of the population in urban centres, metropolitan areas and suburbs, age distribution, income distribution, and so on. This will give an idea on the potential size and value of the market that a product or service can capture. In most cases this data will be very generalized and focused in dealing with public policy priorities, but it can help to identify base targets and their size. Like male children from age 8 to 10 that live in households with incomes over 48,000 euro. This could be a valid target for an ad campaign to sell for example a brand of toys or a fizzy drink, but it’s not enough to have a target. You will need to know more.

SourcesDiagram

The national census is a good source of data but it’s not enough, you’ll need to clear the gaps of your market knowledge by using surveys. These can be phone or web surveys, direct interviews in the street or in people’s homes. These can be part of an ongoing permanent survey or / and a limited ad hoc survey. Also, the methodology in the surveys and the way the data will be treated will be very important. And will have an impact on what kind of extrapolations you are able to make with any reliability from the data.

In any serious marketing survey there is always a consideration that the sample of the survey needs to match the population universe, in whole or in part. And that means that in terms of percentages, of gender, social class, age groups and other socio-demographic features, the ideal sample is similar to the population universe. Although, in some cases for targeted campaigns you might only want to survey people that are potentially under your target demographic. And in all cases, the number of randomly selected people that is part of the survey needs to be large enough and non biased so that it’s statistically representative.

Surveys can be about almost anything, and in the previous diagram there are some examples of what are the scopes required. Since, these can help with the specific needs of marketing research.

As an example, if you need to know how your product or ad campaign can match your target, then you need to have a survey on the consumer preferences and/or behaviours. And find what are the aspects that can enhance your campaign message, packaging or product features to meet the consumer preferences.

Pre-Internet Platform Workflow

Before internet platforms were a thing in the early 2000’s, much of the media business was still working under the traditional model for ad campaigns and established marketing research practises. And, although internet media was already a thing, most content was text and images and sound streaming was still in its infancy. While video streaming was hampered at the time by lack of broadband infrastructure and lack of a viable monetization model.

In this period, ad campaigns were mostly done through mass media, outdoors, newspapers and magazines, radio and TV. These were effective channels of distribution in the sense they would broadcast the campaign message to the media’s captive audience. From this point of view, TV was the broadcast media that had the most consumer reach. And depending on the number of available TV channels and private operators, this media would bask the largest share of the available budgets for marketing campaigns.

The biggest problem though with broadcast media, was that it was very difficult to measure engagement, and if the campaign was reaching the intended audience. This gave an opportunity for print media and radio, which have more specific target audiences. And this was one of their greatest selling points to advertisers.

Measuring each mass media target audience was an essential part of the ad market, this was done by third party companies like Nielsen, TNS Sofres, and many others, that were specialized in marketing research. These provided the base means surveys that helped identify the profile of the audience for each media entity, and the evolution of that profile over time. These were also often responsible for providing the ratings surveys that provided instant audience measurements over time. These ratings would then be decomposed by socio-demographic features, and then sold to advertisers, media and other marketing research companies.

TV broadcasters could then use their average ratings over the available timeslots to set their pricing, with large audience shows having their timeslots sold at a premium to advertisers. In the USA it was common for TV broadcasters to auction timeslots in advance, but in most markets available timeslots were sold only months or weeks in advance of the specific time the promo spot would be aired.

With the ratings, the media socio-demographic profiles and pricing for ad placement, advertisers could optimize their marketing budget per campaign. Allocating budget to each media type to maximize the possible target reach, and get the most bang for the buck. This relied on data, from the likes of Nielsen and all, that was produced weekly, monthly or quarterly for each type of media outlet.

TradMedWorkfDiagram2

Once the ad campaign starts, advertisers need to understand how the campaign is progressing and what is the impact that it’s having. They can have a first overview based on the ratings when the ads were aired. But in the case of print media, outdoors or radio this might not be possible. For this reason there would be follow-up surveys, mostly performed through phone interviews. That would ask people what brands they remembered, what ad messages they recalled and to what brand they associated that message, and also in what media they were exposed to the message.

From the results of these surveys, advertiser could evaluate if the ad campaign is failing to get across the desired message. If the message isn’t associated to the brand, but instead confused with the brand of a competitor. Or if a particular media isn’t working out as a channel for the ad campaign. From this, as an example, advertisers can decide to make changes to the campaign, increase the number of ad placements in newspapers, reduce the number of TV ads. Change the ad that is aired in TV to a shorter version, and let go of all radio ads for the duration of the campaign.

CampaignWorkDiagram3

The one aspect that is most evident about this whole process, is the indirect nature of all these measurements. You aren’t measuring ads against conversion in sales. You are measuring ads against message reach and it’s connection to the brand awareness that can be extrapolated from the responses of potential consumers. In this model of workflow, you have little direct interaction between the ad and the consumer. And, much of the process for ad placement is very much like shot gunning on a target.

Marketing With Platforms

At the time of the dot-com bubble there was little in way to know what were the audiences of the majority of sites. We had mostly what were the numbers that the site operators supplied. Things like number of pages seen and number of hits per site or page. But these could be manipulated and didn’t have any extra information that could help advertisers set their ads to a particular target.

The first attempts to rate and characterize the web audience relied in two ways, a panel survey that would monitor the user web browsing and use that data as the base to extrapolate metrics for a country. Or, by having a component or script in the site’s server that would monitor incoming traffic and navigation trough the site. Each had its strengths and weaknesses, but in general the data available was either limited to a subset of all web users or a subset of all sites.

I was a developer on a web panel survey project in the early 2000’s, and this particular product was used by large telecoms, media companies, and big advertisers. But, one of the issues that was very common then, and still common today, was the fragmentation of the audience. Sure, some portals and most commonly referenced sites were usually ranked at the top in number of views, but there was a large tail of sites. The problem with these sites at the tail, that might be actually be big names or brands, was that in general they wouldn’t have enough numbers of cases to be statistically relevant. Due mostly to the limits on the size of the panel, it’s distribution and composition.

Unbeknownst to us at the time, Google was becoming a big player in the arena, since they started their Adsense platform in 2003. Which besides placing ads in the sites of participants, it would also allow for advertisers to purchase keywords related to their offerings. So that when users did a search query that matched the advertiser’s keyword, it would insert an ad into the search results page. This was how Google was able to monetize its service, but this actually allowed one thing more.

Even before Google Adsense, clicking an ad in a site could give potential useful information, like the source referral and other parameters. Also, cookies could be used in conjunction with the incoming parameters to track a user, and made it possible to have limited glance on a user’s browsing patterns. And since converting the visit into a purchase would in general mean giving personal information details, that could allow for sites to get a better picture of the user if they had the proper tools. But although the early 2000’s were very much a wild west, the lack of tooling and limitations in the medium to provide content meant that user information was fragmented through several sites.

In this sense, Google and other platform providers started offering services through tools and software components that could be blended into the site’s code, or by intercepting incoming traffic on the server, or by supporting add-ons for content management systems. These provided, easy to use add placement platforms, user tracking and monitoring, and aggregation of information for marketing purposes.

SearchAdModelDiagram

The build-up of third party user tracking services and the evolution of freemium offerings of digital platforms like Google, increasingly made panel surveys redundant for doing marketing research for digital media. This trend has accelerated with social media, and with the increasing aggregation of user data at the back-end. Allowing advertisers to micro-target users based on very specific categories that are outside of the traditionally used socio-demographic categories.

Currently the user might be tracked on its browsing behaviour in practically every site. Each has more than one third party service that monitors users arriving to the site and tracks its navigation, and sometimes even its mouse pointer motions. This third-party services could be directly linked to an Ad platform, or could be a data aggregator that then integrates with other platforms, marketing companies or advertisers.

This kind of data aggregation can be federated into a platform identity, a proxy for the user identity within the platform. For this to be possible, the trail of tracking cookies, IP addresses, the user login into the platform services, and even its public social media posts, can be linked together. And, that’s the information they can acquire through your browsing behaviour, there’s also the shady data brokers that sell personal data, and phone & mailing lists (one the main reasons for the GDPR regulation).

Your platform profile data is then used to contextualize services, media feeds and search results. It allows for advertisers to zero-in on your preferences and behaviour. That means that you will see ads following you from site to site, even when you are browsing on different language regions. It will also make surprising choices on the ads it chooses to display, even in mainstream sites. This can be showing ads for dating services, trading services or gambling sites, which might be the result of your previous tracked browsing history.

There are many ways that an ad service might profile you on your recently tracked web behaviour, and in here what I will state is in part educated speculation:

You searched for a particular set of keywords that relate to a product or service, this can cause several follow-up ads to appear during the following hours or days.
You have gone through pictures or social media profiles of people of the opposite sex, with images of what could have been categorized as single and attractive specimens. This doesn’t need to be revealing photos, or anything out of the ordinary, but the frequency of the behaviour can be flagged by something like a deterministic or a statistical classification algorithm. So, consider yourself warned.
You click more frequently in certain links than others, these might be magazines, newspapers or to comment on posts in your social media feed. If the content of those links or posts was categorized, that might give a hint about your preferences. Also, your comments on the posts can be analysed for negative or positive engagement with the categories associated to the post.
Your mobile phone has the location tracking turned on, and you have several friends with that feature also turned on. The platform picks up the associations between you and your friends on social media or the platform services. And that you all gather regularly together in the same location at the same time. If there are enough preferences shared between each profile, then the platform might start showing similar ads to everyone on the group.

Those were some examples, but many more could be produced. And, it’s not very far fetched to think that other types of residual data could be used to associate your online person to a particular set of advertisers. In fact, there is a lot more valuable data on your daily digital habits than what you are aware of.

In fact this kind of workflow lends itself well to automation. Although the content categorization and tagging might be still done by scores of human beings for a pittance. But the process of classifying if the user will be susceptible to a particular type of ad, can be done by utilizing the data trails of thousands of users. By going through the behaviour patterns when browsing, search queries, but also on the information about the user’s own identity and interactions on social media. This makes it easy for the ad platform to make associations from users to advertisers without human intervention.

SurvMarkModelDiagram3

For the advertisers this state of affairs is rather convenient, since this approach to marketing allows for:

Micro-targeting potential customers, or those that should be more receptive to the ad message.
Quick identification of ad conversion into sales.
Continuous monitoring of ad campaign results and engagement.
Identifying new categories that are a better match as targets for campaigns.
Quick turn-around, from scheduling campaigns, to contracting targets, to ad placement.

Also, this approach allowed for a larger number of players to enter in the advertising game. Since it’s cheaper to select a narrow target for the ad campaign than place advertisements on the traditional broadcasting media. And, it’s easier to limit and monitor the budget for the campaign. Although there were situations were competitors or malicious groups auto-clicked on the ads and depleted the available budget of an advertiser, before any interested user could be converted in a sale.

This marketing model is what pays for the platforms, and pays for social media, and in some ways pays for a lot of the media content available in the internet. Although it has some issues associated to it. The first one is scale, platforms and social media sites are larger and control the revenue flow to the media outlets. This has already had consequences on the type of content that is produced in the text media, with shorter articles, less time between postings and more provocative titles. Though, there are here and there sites with quality content, the increasing noise of all this churned material, to get the most clicks and ad revenue, drowns everything that can be relevant.

For the users this model can be convenient, because it allows for services without having to pay directly. But this model has allowed for the large scale dissemination of personal data. And though, it might be relatively innocuous for an advertiser that wants to sell you a car, or some new headphones. But quite another thing, is if the ad will direct you to a scam site, or to a potential ruinous financial scheme. Also, the same tools that help advertisers to track and micro-target users can be used by political parties, or even by intelligence agencies. So, it might be on your best interest to get educated on how to protect your personal data.

Business, Data Science, R, Software

Making Insights By Exploring Simple Market Basket Analysis

In this article I will give some examples of simple analysis that can be done if you have a dataset of purchases, and how this can be extrapolated for other uses. Part of the information was taken from the following article: “A Gentle Introduction on Market Basket Analysis — Association Rules”. Which is a good introduction to market basket analysis through the point of view of using association rules, although I won’t use the methods that are demonstrated there. Instead, I will use a shallow and naive approach, that although will carry limited insights. I think it is better to grasp the underlying base concepts and make you aware of other uses besides retail.

The dataset that I will use can be found here, this is a dataset from an online retailer with transactions going from 01/12/2010 and 09/12/2011. And, to analyse the data I will use R script. I will also make heavy use of available libraries which will make the code very terse. Because of this some extra explanations might be required of why a function is used and what it does.

Starting with the code

We begin with the following libraries that will be necessary to query the data and make the necessary data transformations. In this case I will use mostly general use libraries and not specialized analysis packages that would do most of the work required to find associations between items.

library(tidyverse)
library(readxl)
library(lubridate)
library(dplyr)
library(pracma)
library(reshape2)
library(tibble)

First thing is to load the dataset and make some initial transformations to help out later. And this part follows the same loading pattern as the article that was linked previously.

retail <- read_excel("Data/Online retail.xlsx")
retail <- retail[complete.cases(retail),]
retail <- retail %>% mutate(Description = as.factor(Description))
retail <- retail %>% mutate(Country = as.factor(Country))
retail$Date <- as.Date(retail$InvoiceDate)
retail$Time <- format(retail$InvoiceDate, "%H:%M:%S")
retail$InvoiceNo <- as.numeric(as.character(retail$InvoiceNo))

The next step will be to decide what data features will we base our analysis on. Again, for didactic purposes this will be a very shallow analysis, so we will only choose as features: customerId, invoiceId and Description, which is the name of the product. We will make the initial set of assumptions, that it’s not important when the purchase was made, or in what order it was made, or from where the customer is from. We will only make a static analysis from the selected features, this will be a gross simplification. But this will make it easier to introduce the concepts, and we can grow in complexity later.

The initial goal is to make a pseudo predictions, so that when a customer selects a set of products we want to have an idea based on the data of which products that customer could probably get in the same purchase order. So we start with the following data transformations:

retailUnique <- retail %>% select(InvoiceNo, Description, CustomerID) 
   %>% distinct()
retaiCustomerlUnique <- retailUnique 
   %>% select(CustomerID, Description) 
   %>% distinct()
retailUniqProds <- retaiCustomerlUnique %>% select(Description) 
  %>% distinct() %>% arrange(as.numeric(Description))
retaiCustomerlUnique <- retaiCustomerlUnique 
  %>% mutate(Product = as.numeric(Description)) 
  %>% arrange(CustomerID)

For the first iteration we will sample the whole data for invoices that have at least n-1 of the chosen products. This will get a big enough number of invoices without limiting the sample too much. As I said previously this is a shallow approach and I am not calculating individual probabilities or joint probabilities of particular item-sets. For the first experiment the script will choose randomly three products from the list of unique products.

prods <- retailUniqProds[sample(nrow(retailUniqProds), 3),]
invoices <- retailUnique %>% inner_join(prods) 
  %>% group_by(InvoiceNo, CustomerID) %>% summarise(n = n()) 
  %>% filter(n >= 2)
sampledData <- retailUnique %>% inner_join(invoices)
sampledDataRemoved <- sampledData %>% anti_join(prods)
sampledDataFreqs <- sampledDataRemoved %>% group_by(Description) 
  %>% summarise(Freq = n()) %>% arrange(desc(Freq))

head(sampledDataFreqs)

We first need to get our sampled invoices, and then remove any entries with the chosen products from those invoices. So, that we can make a frequency count and order the table from the most frequent to least.

You can execute several times this script and you will get a list of product frequencies that appear in invoices where at least two of the selected products are part of the purchase. Also, this frequency is related to how many times the product appears in customer invoices. Sometimes the script will return a list, and sometimes it won’t return anything. In that case it means that the selected products haven’t got any pairings in the data set’s invoices.

One of the things that comes through is how difficult it is for n randomly selected items to be present in the same purchase. So increasing n will lower the sample of invoices you can work with, and even if the selected products are present in lots of invoices these might not be present together in the same invoice.

As an example we can have the following randomly selected products:

This will give the following results once we ran the script:

Selecting your own sample

You can also explore other possible item sets by selecting your own list of products, which is shown in the code example bellow.

prods_chosen <- c("WHITE HANGING HEART T-LIGHT HOLDER", 
  "PEACE SMALL WOOD LETTERS", "BREAD BIN DINER STYLE RED ")
prods <- retailUniqProds[retailUniqProds$Description %in% prods_chosen,]
invoices <- retailUnique %>% inner_join(prods) 
  %>% group_by(InvoiceNo, CustomerID) 
  %>% summarise(n = n()) %>% filter(n >= 2)
sampledData <- retailUnique %>% inner_join(invoices)
sampledDataRemoved <- sampledData %>% anti_join(prods)
sampledDataFreqs <- sampledDataRemoved %>% group_by(Description) 
  %>% summarise(Freq = n()) %>% arrange(desc(Freq))

head(sampledDataFreqs)

This will return the following results:

In this particular case we got a higher count for invoices that have at least 2 of the selected products, and as a consequence we can see the six most frequent products that paired with our selected products.

As a test, if you try to filter ‘n’ to higher or equals than 3 in the invoice selection line, you will find that you will get an empty result set back. So the probability of having all three items in an invoice without knowing beforehand the most frequent pairings will be low.

Now, lets profile the customers

We will now change the approach, and instead of using all of the invoices we will profile all customers by the products they bought and rank them by similarity with each other. And, instead of using an anonymous customer we will pick a customer from the dataset. We will still make the choice of n items, but this time we will take a different route. We will pick the sample from the most similar customer profiles that have invoices with the n-1 products.

We will first generate the customer profiles, this will be a shallow profile of a normalized vector with all the products that the customer purchased without regard for quantity or invoice occurrences. And, as similarity metric we will calculate each customer profile against each other using the cosine similarity index.

dfProfileTemplate <- retailUniqProds 
  %>% mutate(Product = as.numeric(Description)) 
  %>% select(Product) %>% t() %>% as.data.frame() 
  %>% add_column(CustomerID = 0, .before = 1)
dfOrderedCustomers <- retaiCustomerlUnique 
  %>% select(CustomerID) %>% distinct() 
  %>% arrange(CustomerID)
dfProfileTemplate[2:(nrow(dfOrderedCustomers) + 1), 1] <- 
  dfOrderedCustomers$CustomerID
dfProfileTemplate <- dfProfileTemplate[-1,]
dfProfileTemplate[is.na(dfProfileTemplate)] <- 0

for (i in 1:nrow(dfOrderedCustomers)) {
  xcustomer <- dfOrderedCustomers[i,]$CustomerID[[1]]
  customerProds <- retaiCustomerlUnique 
    %>% filter(CustomerID == xcustomer) 
    %>% select(Product) %>% arrange(Product)

  for (j in 1:nrow(customerProds)) {
    prod <- customerProds[j,]$Product[[1]] + 1
    dfProfileTemplate[i, prod] <- 1
  }
}

This first part of the code will generate the data frame with the necessary canonic vectors of existing products for each customer. Which will have the CustomerId column, as well as all the available 3877 products listed as columns and each of these can have only 0 or 1 as a value.

In the following code we show two functions for calculating the similarity cosine, the first version is the simple definition of the similarity metric and the second version is a slightly optimized version that receives the vectors already in a pre-digested form. This metric will return a real number from 0 to 1, being that 0 means that the two vectors don’t match and 1 that they are exactly the same.

fnCosineSimilarity <- function(vect1, vect2) {
    similarity <- dot(vect1, vect2) / (sqrt(dot(vect1, vect1) * dot(vect2, vect2)))
    return(similarity)
}

fnCosineSimilarity2 <- function(vect1, vect1Mod, vect2, vect2Mod) {
    similarity <- dot(vect1, vect2) / (sqrt(vect1Mod * vect2Mod))
    return(similarity)
}

numberRows <- nrow(dfOrderedCustomers)
nc <- ncol(dfProfileTemplate)
customers <- dfOrderedCustomers$CustomerID;
lstVectors <- list()
lstVectorsMod <- list()
vectCustomerID1 <- c()
vectCustomerID2 <- c()
vectSimilarityIndex <- c()

for (i in 1:nrow(dfOrderedCustomers)) {
    lstVectors[[i]] <- as.numeric(dfProfileTemplate[i, 2:nc])
    lstVectorsMod[[i]] <- dot(lstVectors[[i]], lstVectors[[i]])
}

for (i in 1:numberRows) {
    cust1 <- customers[i]
    perc <- i / numberRows

    for (j in (i + 1):numberRows) {
        cust2 <- customers[j]
        simil <- fnCosineSimilarity2(lstVectors[[i]], lstVectorsMod[[i]], lstVectors[[j]], lstVectorsMod[[j]])

        if (simil > 0) {
            vectCustomerID1 <- append(vectCustomerID1, cust1)
            vectCustomerID2 <- append(vectCustomerID2, cust2)
            vectSimilarityIndex <- append(vectSimilarityIndex, simil)
        }
    }
}

dfProfileSimilarity <- data.frame(CustomerID1 = vectCustomerID1, CustomerID2 = vectCustomerID2, SimilarityIndex = vectSimilarityIndex)

By using the previous code we will create a data frame that will pair each customer with another one and mark them with a similarity index based on their product profiles. This won’t be the most efficient algorithm implementation, so expect that it will take some time to get all the customers similarity scores.

Next step, we need to pick a customer number, the similarity index threshold that will filter the number of customer invoices, and the n products that we want to check. In this case I chose the customer 15313, for convenience reasons since this has a lot of other customers matching it’s profile. But I chose a similarity threshold of 0.1, since I don’t want to leave out too many invoices for the comparison process and still get some frequencies back.

dfCustomersBySimilarity <- dfProfileSimilarity 
  %>% filter(SimilarityIndex >= 0.1 & (CustomerID1 == 15313 | CustomerID2 == 15313)) 
  %>% mutate(CustomerID = ifelse(CustomerID1 == 15313, CustomerID2, CustomerID1))
dfJustCustomerId <- dfCustomersBySimilarity %>% select(CustomerID)

prods_chosen <- c("STRAWBERRY CHARLOTTE BAG", 
  "JUMBO BAG BAROQUE BLACK WHITE", "LUNCH BAG RED RETROSPOT")
prods <- retailUniqProds[retailUniqProds$Description %in% prods_chosen,]
invoices <- retailUnique %>% inner_join(dfJustCustomerId) 
  %>% inner_join(prods) %>% group_by(InvoiceNo, CustomerID) 
  %>% summarise(n = n()) %>% filter(n >= 3)
sampledData <- retailUnique %>% inner_join(invoices)
sampledDataRemoved <- sampledData %>% anti_join(prods)
sampledDataFreqs <- sampledDataRemoved %>% group_by(Description) 
  %>% summarise(Freq = n()) %>% arrange(desc(Freq))

head(sampledDataFreqs)

Once we execute the previous code we will get the following results:

As you can see, now we were able to get all of the invoices with the three products, although the frequencies are rather low. But at the same time we are matching invoices from customers that have a greater similarity to our target customer.

Final Thoughts

The methods that are shown in this article won’t find automatically what products will be more frequently paired, the idea is to enable exploration of the base concepts that lead to that path. If what you want is to get the product rule sets, then I would recommend to read the article that is linked at start of this post.

In the first part of this article, I am trying to show how you can explore relations between products and what patterns can appear by mining the data of several online shoppers. This has a big obstacle from the start, that for any given invoice the conditional probability of most combinations of ‘n’ random products being present is rather low. That is the reason that instead of checking on all selected items, the check is if N-1 selected items are present in the invoice.

In the second part, the idea is to explore how profiling customers can give more relevant results. Since, birds of a feather tend to flock together, it makes sense to try to explore the idea of having product rule sets for customers that share similar tastes. Of course, there will be issues in terms of getting the sample size that allows to make meaningful comparisons. And the same issues of conditional probability are bound to happen in this case as well. Although for customers that have a large degree of shared patterns with others, this does increase the odds on our side.

In both cases that we explored this data set, we didn’t look at any of the time related data features. This was a simplification, in reality the time dimension is quite important. Since, products might have a seasonal component, being purchased more often in Christmas, for Valentine’s Day, or at the start of the Summer Holidays. Also, products might be discontinued, or appear or disappear depending on fads and other consumer trends. The other aspect was also regional and geographic related data features, customers from different countries might have slightly different shopping patterns.

The techniques in profiling users, are also useful in finding ways to recommend products that match the previous patterns of similar customers. This might have the unintended side effect of increasing the convergence between customers in terms of purchasing patterns. Making them even more similar, which might not be the intended result.

Agile, Project Management, Rambling, Software, Waterfall

Why Methodologies Fail to Bring Home the Bacon

In my career I have seen methodologies falling in and out of favour, with each new wave promising to solve the problems that plagued the previous methodology du jour. Don’t get me wrong, I am not against using a methodology, whatever works for you I am fine with it. Just as long as you are getting reliable results and everyone is happy with it. What I find problematic is that we move from one set of complaints to another set of complaints, we moved the goal posts but we still seem to be having problems.

In my view, I think the problem is larger in scope, and has to do with issues that are related to the relative novelty that is software engineering. This is a field that only took off after the Second World War, it is still in its infancy in terms of practises and established traditions. Unlike other engineering practises that span at least since the age of enlightenment and in some cases have written foundation books and documents that are much older. This means that some sort of state of flux is to be expected in this field until some common agreed rules that are proved to reliably work and in which contexts.

The problems that methodologies try to address are complicated in nature, and software by its own nature doesn’t make them any easier…

Issues With Software Development

Unlike physical things like a building, a bridge, a toaster or a book, software doesn’t have a physical existence that can be touched or seen in its totality. We are able to see its artefacts, screens, commands, files, physical media (disks, flash drives, etc…), but we can’t fully verify its completeness. In this, software is a functional abstraction that implements a set of expectations, these can only be verified through its usage.

Modern computing languages allow a large degree of freedom on how to implement concepts, these aren’t constrained by physical limits, materials or manufacturing processes. For software the limits and constraints are usually over the available computation power, storage and the ability to model solutions with the available tools. This has meant that for any set of similar requirements we can have a large degree of variability in terms of implementation, which can vary depending on the languages or libraries/frameworks that are fashionable at any given time. Also, there is a large degree of variability due to human factors, like personal preferences, negative biases toward particular technologies, considerations about learning curves and the odd lack of competence here and there.

By allowing a large degree of variability in terms of implementation this has meant that good design is at a premium, and traditionally computer science has stressed the teaching of algorithms and data structures, and other low-level constructs. But, though these can be useful in the cases that developers are doing foundational work in operating systems, implementing databases, device drivers, and the sort. Most software is built around businesses or business applications, or products that are implemented using higher level or scripting languages. Another issue is that, most people are actually bad at design and tend to have little knowledge and exposure to different exemplars of design solutions. That usually results in attempts to force design solutions to problems that they weren’t expected to solve, this would be like trying to make a rocket using propellers and piston engines. Also, bad design choices can have disastrous implications when new or ancillary requirements appear, making development more onerous and time-consuming or requiring starting from scratch.

When it comes to implementation, the immaterial nature of software makes it problematic to verify its state of completeness when it is being developed. Most software developers strive to develop their code as close to the requirements as they can understand them, but their understanding of the problem might be incomplete. There might also be time constraints that limit the amount of analysis done, that can impact of the quality and fitness of the code. In these cases, a developer might also have an incentive to feel that the work is complete and only perform superficial tests, this is more probable where its something he/she doesn’t think that is pleasant or understands well. And as such the lack of clarity in what constitutes completion given an understanding of a requirement is a constant point of contention.

From the project management standpoint the problem of the verification of the state of completeness is even larger, in most cases managers will not have the technical expertise to evaluate the source code. Managers will often stress for control in terms of deadlines and budget, and this creates a conflict of interest between developers and project managers. That means that developers that are being pressured to end their tasks, to meet deadlines, will have an incentive to say that their work is complete and worry later about defects. While managers that are under pressure to keep the budget and deadlines under control might be less willing to allow the necessary time for the project requirements being met. This coordination problem derived from the uncertainties around the requirement completion is one of the main drivers for control processes and the adoption of methodologies.

The employment of Quality Assurance Analysts / Testers to verify completion is not without issues, the same problem of understanding the requirements and building test cases that can cover these is not an assurance that the software will be actually complete. The requirements might have omissions or hidden assumptions, these can lead to conflict between testers and developers, these might not even be evident to the business analyst or for the people that written the requirements. Also, some cases might require to test a large degree of variation, this is often expensive and time-consuming to be done manually. And, even if test automation is available it is highly dependent on the person modelling the test case generation for getting maximum coverage.

A Very Short Story of Methodologies in Software Development

Initially attempts to devise software development methodologies circled around structured programming, this was the foundation of many of the things that we take for granted today. The focus was developing code block concepts that could compartmentalize functionality and allow source code that was easy to read. This allowed for concepts like flow diagrams to model programs, and from here things would get increasingly more complex.

The development of subsequent methodologies paralleled the processes that were common in big engineering organizations, projects or products had reasonably long development cycles from concept to market. The computational tools that were available in 60’s and 70’s didn’t allow for fast development processes, and in many ways the waterfall process is a product of a time where software developers didn’t have interactive debuggers and were forced to test their programs by batch running in time shared environments.

As operating systems and computers allowed for more interactive IO (Input / Output), there was an increasing push for more interactive and incremental methodologies that allowed for quicker release cycles. This, in a sense, was a measure of how computation technologies were becoming more ubiquitous. Big organizations could afford long development cycles but smaller organizations didn’t have the same level of tolerance and resources.

The increasing availability of computers through the 80’s and 90’s allowed for even more organizations to start developing software, this time with the availability of ever more powerful editors and development environments. Some started copying some version of waterfall, others practised a more ad hoc or interactive process. Again, big organizations could afford the cost associated with using waterfall, RUP (Rational Unified Process) or other common methodologies at the time. But one thing was sure, the drive for faster cycles of development was a persistent trend.

One thing was common through this whole time, rates of success for software projects were not encouraging. Large projects using waterfall methods could get bogged down between steps, change management was a large painful process that could scuttle the project. This would mean missed deadlines and cost overruns, and other types of losses depending on the type and function of the project. But even incremental or ad hoc methods didn’t fare much better, there was a generalized dissatisfaction with the state of software development.

In a sense this was the result of the successes and the large strides that were made up to the 90’s and early 2000’s, the tools were much better and the infrastructure available was a big improvement from the days of time share mainframes. Databases, queueing middle-ware, operating systems, etc, there was a rich ecosystem of platforms and tools, but the development of business applications had moved the goal posts. For staying in same place everyone had to run faster, and that meant developers had to take into account more functionalities and concerns. And this trend is continuing to this day, and doesn’t seem to be slowing down.

The agile manifesto was a rallying cry against waterfall, specially the large overheads in documentation and analysis, the low flexibility to change and the long cycles. Instead it proposed methods that appeared in the last half of the 90’s, like scrum, XP and others. In its birth it was mostly a hodgepodge of methods that were loaned, but it grew and gained momentum. It kept the habit of taking recent ideas and calling them agile, nowadays it kinda seems like a large buffet of methods and processes.

Agile is the new waterfall, it has become the dominant methodology buzzword of the day. It allows a lot of flexibility on what is called agile and this low strict adherence to a range of integrated methods has been one of the key factors allowing its growth. That and, the sprint model that limits the cycles of development into a fixed number of days. If you are doing sprints, then it seems you are doing agile…

Issues With Any Methodology

Adherence to a methodology is not a sure-fire way that you’ll get better results, there are many aspects that might lead to a less effective adoption. Also, there might be business and regulatory considerations that might conflict with the use of a particular methodology without getting rid of aspects of the previous one. Here are some for your reading pleasure…

Methodology Theater

Methodology theater is when an organization decides to buy in to a methodology on a superficial level. It implements the parts that don’t require any organizational change, like arranging development cycles around sprints, allowing or requiring that some key members are certified, using buzzwords in documents and on the company’s official communication. But on the overall, the development process hasn’t changed, the responsibilities within the team are unchanged and the previous status quo was kept.

The other issue with methodology theater is the increase in requirements for moving up in the ladder, it is not unusual to have some positions limited for those people who have training or are certified in a particular methodology.

No True Scotsman Fallacy

This is a common situation when the failure of a project while adopting a methodology or process is treated as if the methodology wasn’t really implemented. When people don’t want to believe that the methodology is in part or on the whole a cause for the failure, then they double down on their attempts and try to have a more strict adherence to the process. Things like, not agile enough are a possible indication that deeper problems might be at the root, and methodology adherence by itself might not solve them.

Having the Cake and Eating the Cake

This is when companies setup agile methods but keep waterfall requirements, this results in scrumfall the worse of both worlds. And because of this, developers are saddled with the same documentation requirements, but with less time available on the development cycle and worse requirement specs from the stakeholders. Here, organizations are attempting to cut development times by constraining developers, but letting the business off the hook in terms of specifying in detail what they need. At the same time they might be legally bound by regulations to have documentation tracking the whole process.

Cheating on the Learning Curve

This is when an organization thinks that the implementation of a methodology will magically solve the issues related with having people with training and knowledge, and without having the proper organizational knowledge base to successfully finish projects. No methodology can replace the accumulated organizational experience of the many people who work within a company, these informal ties that bind the organization into a functional organism are essential. Trying to skip and cut corners on the learning curve will leave outcomes to chance.

A Proper Way to Sort it All…

In my view the proper way to check if a methodology works, and in what context it is better suited, is by science. Namely statistics, proper experience modelling, surveys, and in the best of worlds it would be excellent if it could be verified by double blind testing. Unfortunately the last option is not possible… I propose three ways to sort out what methods work better.

The Undergraduate Challenge

Have several teams of undergraduates distributed around several university campuses, projects would be randomly assigned as well as the methodology each would be applying. For each team there would be a task master monitoring, each team would work 8 hours a day in the same room until project completion or until the time was available was spent.

The projects would fall into different categories of problems, though would be similar to usual business applications not as academic research projects. There would be variations on the same category to avoid cheating or collusion, also there would be some maintenance cycles added so that the teams would be forced to live with the consequences of their design decisions.

Evaluation would be done by analysing the results, on how many teams completed the challenge, how many development cycles were completed, time to completion, defect counts, defect resolution counts, and other metrics. This would be cross-tabulated by methodology and project category to check for success indicators that can indicate the level of fitness of each methodology.

Problems with this approach:

Undergraduates mostly lack work experience and are still too used to develop throwaway code for class projects.
The way the incentives are designed for the experiment might not match the reality of work in companies.
The sample projects might lack the sort of “experiences” that cause friction and generate what could called an implementation fog.

The Mass Survey

This would be a large scale survey that would track several organizations during several years, this would check on several project teams and follow which methodologies they are using. And, check on the rates of success, how they vary over time and when teams change methodology. This would also have to control for team sizes, relative levels of experience within the teams, types of projects, technologies and platforms used. The data gathered would then be treated to check if there were any differences in outcomes due to the use of a methodology, or if other factors had a larger weight.

Problems with this approach:

Gathering data would be tricky, companies don’t usually like this kind of scrutiny and voluntary participation isn’t a sure thing.
Survey modelling would be tricky, since we could expect a large degree or variability in each company and over time.
There might be confounding factors that skew the results, like hidden or unacknowledged innovations (teams or individuals that use methods or practises outside of the methodology that confer better than expected outcomes).

The Self Reported Survey

This would allow for participant companies to answer a set of questionnaires over time, and these results would then be analysed to verify any trend. This would be an easier option, and often is the most practised by sending survey to IT managers or developers.

Problems with this approach:

Answers are often based on a subjective reading of events.
The survey is superficial and doesn’t give enough context.

In Conclusion…

It is not my goal to promote a methodology or discourage you of using one, it is you that need to make a decision if what you are using is working for you. What I tried to do is to identify issues that lead to suboptimal outcomes. And, by acknowledging these, there might be a way to find a better path to successful and satisfying software development.

CodeProject, Project Management, Rambling, Software

How To Turn Around a Failing Software Project

In this article I will discuss some scenarios and situations that involve turning around a project and save it from the jaws of failure. I will also discuss about those cases where chances are slim or non-existent. And, discuss some of the possible solutions and pitfalls. A word of warning, this won’t be about heart warming stories and people probably won’t be singing kumbaya afterwards.

Failure and success can be framed in many ways for a project, depending if you are in a product company, a services company or a non-tech company developing internal projects. In the first case, success might be defined in terms of goals achieved (business and/or technical) and meeting deadlines, in a service company successful delivery on time and spec for a particular customer or customers. And, in the later it will be successful delivery on time, satisfying the requirements of the internal stakeholders. I know… This is all very fuzzy in terms of defining success.

While success is not guaranteed. Failure, on the other side, can rear its ugly head early on. When deadlines are missed, budgets burn like a match and technical flaws show themselves with high bug counts. As problems start to accumulate people start to get anxious, reputations are at stake, sometimes the company’s future is at stake. As the project spirals close to failure, changes are need to turn the boat around with a new captain on board.

Naval metaphors aside, being project manager under duress is very much like a soccer team manager that was hired to avoid relegation to a minor league (yay, sports metaphor). Some managers are specialized in turnarounds, it is not a pleasant job. Usually it comes with consequences for the team members, with some players getting health issues for the rest of their lives in the process.

But project management isn’t operating in a competitive sports tournament landscape, projects aren’t a sequence of games that need to be won against other teams and intermediate victories can be invisible to stakeholders, and defining success is not as easy as checking against a score.

Best Case Scenarios

Failure presents itself in many forms, sometimes for external reasons like a death of a key person, an unexpected downturn in the economy, a middleware product that was discontinued that can severely impact the chances of success. Many are internal, bad choices when selecting managers and team members, company culture and other human related issues. From my experience, technical issues are most of the time more tractable than human issues. Specially if what is required is more time to get around the learning curve. Human issues on the other side can spiral into a lot of drama and soap opera re-enactments with no extra dramatic music though.

For when you are presented into the drama of a failing project the best situation that you can have is:

Upper management is committed to the success of the project (a bit desperate will be nice, but not too much).
The previous project manager has left the company (more on that later).
The team members aren’t all duds and have enough skills to get the work done.
The budget isn’t running on fumes.
The company’s culture isn’t the cloak and dagger type.

The first thing that you will need to do is to understand what the project is, its goals and scope, timetables, technical architecture and technology dependencies. Afterwards, meet the team, understand what roles each one has, read their interactions and their demeanour, don’t make any judgements based of first impressions. Discuss what was done by the team so far, if possible do one-on-one meetings, check if the information from the team in the meetings, and on the individual interactions are consistent with each other. And, if the information that the team is giving is consistent with the feedback from upper management.

If possible, before interacting with the customer or third-party stakeholders get an understanding if the previous project manager has lied, mischaracterized, deceived or in any way told tall tales that have framed their minds about the state of the project. If the answer is yes, then thread carefully… People often get very emotional when the facts are told, so start getting some information about these third parties and don’t do any reveals on your first meetings with them.

On the first meetings with your peers and direct superiors discuss small on-boarding issues and check their responses, also evaluate their actions to small requests for information that are connected to the project and the company. In those meetings check how everyone behaves, determine if there are managers that are prone to snap or bully. Check how other managers react when they feel under pressure. This will be important to evaluate to what degree the systems of rewards and sanctions are applied within the company and what type of personality is usually hired for managerial roles.

After the initial stages of getting to know the team, peers and third parties, make an initial evaluation and determine what are the ones that can help you, those that are neutral and those that can hinder you. This evaluation is not a static evaluation but a continuous loop until the project is finished. So, when you implement an initiative, you will monitor the feedback and responses, and re-evaluate your position, and make corrective changes for the next iteration.

One of the items that requires quick evaluation is the overall composition of the team, identifying poor performers and team members that generate entropy. Team members that don’t contribute with much code and have a high rate of defects for the amount of code they contribute, or break the builds frequently and don’t listen to other team members feedback, should be removed from the team as quickly and painlessly as possible. Check for internal replacements if possible, prioritize for developers with a good track record and that know what they are doing. If you need to hire or contract, prioritize people who you already know and worked with before and meet the previous conditions – keep a habit of keeping in touch with developers that performed well in the past and that can still help you or can reference people who you can work with.

Keep an eye for armchair trainers that keep second guessing your actions, these can be specially problematic before you can show results and can frame the attitudes of others and of the upper management.

On the project side you will need to tightly manage scope, it is of the utmost importance to identify and focus on the core deliverables and avoid being distracted by side-features. Renegotiate the scope with the stakeholders and have them agree to a primary set of deliverables. This is the time, when it important to know in advance if the previous manager over-promised or lied about the status of the project. Scope needs to be negotiated at every iteration, and it needs to fit the capacity and the skills of the team, over-promising or over committing will not get your project on-time and on-spec.

To evaluate the status of the project you will need to meet regularly with the team members, either on stand-ups, sprint meetings or meetings to discuss project issues. Meet individually, if possible, to check on critical developments and evaluate what work is still required and if there are any blockers. Insist on a continuous release system with at least a set of sanity tests to verify early problems. Verify the feedback from the QA team and check how the defect count is evolving and verify if individual team members require help in resolving any defect. Prioritize stability over features, team members need to fix defects before starting new work.

On the QA side of things prefer a data driven approach, have a process for comparing the results of the software project the team is developing against an older version of the product, client or company data, or business scenarios developed by the QA team or Business Analysts. Use also methods to generate noisy data to verify how the software handles boundary conditions and errors. This will provide a benchmark of progress that will help in gaining the stakeholders thrust.

Be alert for software dependencies that start breaking your builds, track their versions and make a judgement call with the team about when to revert to an older version and when to fix the code that no longer works with the new version of the library. Schedule this work around deliveries and avoid doing this when pressed on a deadline. Also, avoid the temptation of switching from software libraries for capricious reasons, only if there are clear advantages and the schedule allows for rework.

Automate as much as you can, don’t fall for the trap that says there is no time for automation. Start automating at the very beginning by identifying a suitable candidate to do this work, and identify the work that has the most repeatable patterns. By doing this you can create enough slack that can be used to mitigate any need for extra rework due to unforeseen events, a missed requirement or to clear defects.

Make clear and frequent reports with the milestones that were reached, but beware of sounding too optimistic. That might create the incentive for upper management to start cutting resources, budget or reduce the time on the deadlines before critical work is finished. Also it might create exaggerated expectations that might be dashed by unknown unknowns. All of these items represent extra unnecessary stress and have the potential of becoming a reputation risk for yourself.

Find ways to reward the team when a difficult milestone is reached, it is important signal that their effort is appreciated. When communicating with them, be straightforward and clear. If there is a risk of the project being cancelled due to a change of mind of a stakeholder, let them know. Address these risks with the team and also the actions that are being taken that can mitigate the risk.

On this scenario a successful turn around is still a lot of work, but you are still dependent on managing the relationship with third-party stakeholders. These need to be reassured by the quality of the deliverables, and they need to feel that the scope of the project is bringing them value. If they feel otherwise they might be tempted to cancel their participation or cancel the project.

From this best case scenario I have given a template that will be expanded in the next cases that are much less optimistic.

Project Gone Wrong II

In this situation you will have your work cut out for you. Here there is an increased risk of things going wrong, mostly due to human issues. In this case the project is already showing signs of stress, and there are extra factors that can increase risk like:

Upper management requires quick results but isn’t committing much into new resources or support.
The team is unbalanced, with not enough highly skilled developers.
The budget is conditional on reaching goals, so cancellation is a constant spectre.
The previous manager is still working in the organization.

In this situation you will need to quickly determine which developers are good enough to continue the work, and identify those you need to try to replace as quickly as you can. You will also need to verify if the team lead is part of the problem and if you will need to find a new one. In this, being an insider actually pays off, you probably already know the people in the team and who in the company can be allocated. So that you are able to re-structure the team for the project needs. An outsider will have a tough time to quickly understand those that can get results from those that sit idle browsing Facebook as they fake being busy.

The other question is understanding why the previous manager left the project but still kept its job in the company, the reasons why will indicate if that will become a problem to you or not:

Sickness, or burnout which could indicate that he or she might have been under a lot of pressure.
Calling in sick, which can be used to avoid career destroying events and get the blame on someone else.
The previous manager has a powerful protector, and to avoid being tarnished by failure a new placement was found. In this case be very careful.
Just gave up on the project and negotiated another placement.

The previous manager can also act as a spoiler in the case your performance is better than his/her, to avoid looking bad he or she might feel the need to frame other managers opinion of you. So it is good to get an idea of his/her personality to estimate this risk, discreetly get some references from other people who worked with him/her.

Find if there were other managers meddling or trying to change the direction of the project to fit their needs. Check what is their influence and their standing in the pecking order. Check if they can be co-opted, or if they can be persuaded to stop and be on the sidelines. Otherwise, you might have to use guile to put them out of action. A warning, using such means will get you enemies that will not hesitate to punish you when the time comes.

Check with stakeholders and try to find what is their current level of commitment and what is their general mood about the project. Check also if there seems to be an indication that they are contacting or already in a working relationship with other companies or teams to do similar work.

Address any issues of inter-departmental conflict of interests. Like in the case that QA is a separate team. Check to what extent the head of QA has an interest in getting a bigger share of the available budget. And, if the QA team apply tactics that maximize the amount testing time or resources. This could be for example, finding critical defects a day before the release on every release. Always check for the incentives that can reward this kind of behaviour and how it can damage the thrust between the development and QA teams.

After identifying the sources of internal and external political risk devise a strategy to counter these risks, and to mitigate them so that the development team is insulated from theses issues and have the conditions to continue their work in relative peace.

Work to keep an effective control loop on the project status with the development team, track issues early on and don’t fall for the trap of magical thinking. Devise clever methods to get information from the tracking tools available so that you can automate part of your process.

Be on the lookout for motivational issues, if team members are becoming unresponsive or an undeclared conflict is going on between two team members. Or worse, the team has split into factions. Work to resolve these issues as quickly as possible and don’t let them fester. Cause your success will depend on getting them to work as a cohesive team.

In these scenarios, the challenge will be navigating the internal political situation in the office, maintain external stakeholders committed, and resolving any issues with the project team, while having an unfavourable hand at the start. And probably a lot of bumpy roads to cross…

Hopeless Cases

Not all projects are salvageable, and there are times you should consider to forgo such “opportunities” for development. And, you will need to be able to identify when these situations are presented to you. Because some people might have an interest in passing the buck to you or someone else. To avoid being fooled you will need to read the signs, like:

No commitment by upper-management and/or external stakeholders in the project.
No budget, and there are only vague promises of new budget allocations.
The team is filled with flunkies.
The company is a hotbed of toxic office politics.

For example, you might be told that you are replacing another project manager on a much hyped project, and that this manager will be promoted to a new position. And after some due diligence on your part (or you already know), you find that the project was nothing more than placeholder for the previous PM, and that this PM had already been fast tracked for a promotion, but needed some sort of “fluff” project to justify said promotion. Now, the project was never meant to produce results, but now it needs to be shutdown without tarnishing the reputation of the previous PM. And in this case, you are the designated fall guy, in this there is no upside for you. Even if you are capable of showing results, upper-management might actively undermine the project to protect the previous PM. Try to find a good excuse to not accept these situations or get a new job.

Avoid getting yourself involved in projects that everyone expects a hail Mary moment to save the day. If it reeks of desperation run!

If the development team is filled with friends of the previous PM, or some other manager, and have a care free attitude about work. And you can only count with those in the team and can’t hire new ones to replace them. Don’t loose your hair, find another placement. Avoid working with people who you don’t thrust their work or their commitment.

In Conclusion

Turning a project around and safely navigating it to success is an art, as you could see I didn’t talk much about technical aspects but more on the human factor. I believe that most projects that fail, usually get into that situation by human intervention. The technical side is as daunting as the time necessary for the learning curve to be mastered, and the amount of resources it requires to completion. Unless, you are trying the impossible given the technical means that exist and the concepts of the time.

Projects usually fail due to failure in scope management, failure in balancing the project team with skilled people, failure in properly managing relationships between the team and stakeholders, not having proper levels of commitment from the higher-ups, and failure to manage office politics.

To be successful, you will need to find your own style for dealing with each one of these issues. A good piece of advice is, find and retain your allies. You will need them, cause no man or woman can do all by themselves.