Learn from your data: Perform association analysis with Python and pandas

Identify relationships between items in your transaction data—and make business decisions based on those relationships.

By Yuli Vasiliev | November 2021

Learn from your data: Perform association analysis with Python and pandas

Your organization has tons of transaction data—and creates more data every day. There’s invaluable knowledge in those database records! Association analysis lets your organization benefit from this data, extracting useful information from it. This article explains how to take advantage of this technique to identify relationships between items in transaction data and turn them into actionable insights.

After a brief discussion on the basics of association analysis, you’ll see some examples of applying association analysis techniques to a sample set of transaction data in Python and in Oracle Database for comparison. This article assumes some knowledge of Python and Oracle Database, such as being able to install libraries with pip.

Introduction to association analysis

Association analysis discovers the probability of the co-occurrence of items in a collection. The strength of associations between co-occurring items is measured by metrics known as association rules. In essence, association rules are IF-THEN statements that express the probability of relationships between items. An association rule is denoted as X -> Y, where X is the IF component of the rule, called the antecedent, and Y is the THEN component, called the consequent. Or, to put it more plainly, association analysis tells you that if X occurs in a record in the dataset, how likely it is that X would show up in the same record.

In IF-THEN notation, the above rule might be denoted as IF X THEN Y.

Association rules are often used to analyze sales transaction data, helping retailers find relationships between the products that people buy together frequently. For example, the rule IF {bread, butter} THEN {cheese}, which might be found in the transaction data of a grocery store, indicates whether, when shoppers buy bread and butter in the same shopping trip, they are likely to also pick up cheese. This knowledge might drive coupons and special pricing, product placement, and inventory management. When the insights are not intuitively obvious, they could present not only a way to increase revenue but also a competitive advantage.

There are several metrics for evaluating the compliance of a rule. The most common association metrics are the following:

  • Support is the ratio of transactions that include a certain item (or an itemset) to the total number of transactions.
  • Confidence is the ratio of transactions in which the antecedent and the consequent occur together to the number of transactions in which the antecedent occurs.
  • Lift is the ratio of the joint occurrence of the antecedent and the consequent to the product of the probabilities of occurrences of the antecedent and the consequent if they were independent.

In the first place, you can use the support metric to determine frequent itemsets in your transaction set (a set of items that occurs in many transactions is considered frequent). The other metrics can then evaluate association rules for the frequent itemsets. The Apriori algorithm, developed by Rakesh Agrawal and Ramakrishnan Srikant in the 1990s, is designed just for that: determining frequent itemsets and evaluating association rules on them.

Uncovering association patterns with the Apriori algorithm

Using the Apriori algorithm consists of performing the following two steps:

  • Determine frequent itemsets in a transaction dataset.
  • Generate association rules for those frequent itemsets.
  • Below is a simple example that illustrates how this algorithm can be applied in practice. The example uses the sample transaction dataset implemented as a Python list of lists, where each nested list represents a set of products found in a single transaction.

    transactions = [
     ['cabbage', 'cucumber'], ['cabbage', 'eggplant', 'cucumber'], ['carrot', 'potato', 'pepper'],
     ['carrot', 'pepper'], ['carrot', 'tomato'], ['spinach', 'eggplant', 'pumpkin'], ['carrot', 'tomato', 'corn'],
     ['lettuce', 'leek'], ['cabbage', 'cucumber', 'spinach'], ['corn', 'broccoli', 'tomato'],
     ['radishes', 'potato'], ['carrot', 'potato'], ['radishes', 'beets', 'potato'], ['cabbage', 'beans'],
     ['carrot', 'potato', 'pepper'], ['carrot', 'cucumber', 'pepper'], ['artichoke', 'cucumber'],
     ['cabbage', 'cucumber'], ['carrot', 'beets'], ['onion', 'garlic']

    This sample dataset consists of 20 transactions, each of which contains two to three articles. (In a real transaction dataset, there would likely be millions of transactions, which might be considerably more complex—think about your typical store receipt covering a week of grocery shopping for a family.)

    If you look through the transactions, you may notice that some products appear more often than others and some of them tend to appear together. For example, there are five transactions with cabbage and six transactions with cucumber, among which there are four that contain both cabbage and cucumber. Let’s now look at how to calculate the main metrics for the cabbage -> cucumber association rule.

    As mentioned, the support of an item is the ratio of transactions that include this item to the total number of transactions. So, the support for cabbage in the example being discussed can be calculated as follows:

    support(cabbage) = cabbage/total -> 5/20 -> 0.25

    For itemsets, the support is the ratio of transactions in which an itemset occurs to the total number of transactions. Thus, you could use the following formula to calculate the support for the cabbage & cucumber itemset defined in the example:

    support(cabbage -> cucumber) = (cabbage & cucumber)/total -> 4/20 -> 0.2

    A support metric falls in the range of 0 to 1. The Apriori algorithm uses a user-specified support threshold to determine if an itemset can be considered frequent. For example, if you set the support threshold to 0.1, any set of items that appear together in at least 10% of all the transactions will be considered frequent.

    Once the set of frequent itemsets has been determined, you can evaluate association rules on them. Confidence and lift are the primary metrics. Turning back to the example, confidence can be interpreted as the likelihood of purchasing cabbage along with cucumber. So, the confidence for the cabbage -> cucumber association rule can be calculated as follows:

    confidence(cabbage -> cucumber) = (cabbage & cucumber)/cabbage -> 4/5 -> 0.8

    Confidence can be in the range of 0 to 1. This metric is not symmetric. For example, the confidence for cabbage -> cucumber differs from the confidence for cucumber -> cabbage in this example.

    confidence(cucumber -> cabbage) = (cabbage & cucumber)/cucumber -> 4/6 -> 0.66

    Confidence is a measure of performance for an association rule, indicating how often the consequent follows the antecedent.

    As mentioned, lift is another important metric for evaluating association rules. The lift of association rule cabbage -> cucumber is the ratio of the observed support for cabbage -> cucumber to that expected if cabbage and cucumber were independent of each other. This can be calculated as follows:

    lift(cabbage -> cucumber) = support(cabbage & cucumber)/(support(cabbage)*support(cucumber)) -> 0.2/(0.25*0.3) -> 2.66

    The range of possible values for lift is [0,∞]. A lift ratio larger than 1 indicates that the relationship between the antecedent and consequent referenced in the rule is stronger than would be expected if they were independent.

    Processing a transaction dataset with Apriori

    A set of association rules can be automatically generated against your transaction data. In Python, this can be done with libraries such as Mlxtend, which provides an implementation of the Apriori algorithm. You can install Mlxtend with pip, as follows:

    pip install mlxtend

    After that, you can apply the Apriori algorithm to the transaction dataset, which was defined as a Python list of lists earlier. The first step is to transform the dataset into the format that can be used in further processing. When using Mlxtend, you need to convert the dataset into a one-hot encoded array of transactions. In the following snippet, you do that, saving the encoded array of transactions in a pandas DataFrame:

    import pandas as pd
    from mlxtend.preprocessing import TransactionEncoder
    encoder = TransactionEncoder()
    encoded_array = encoder.fit(transactions).transform(transactions)
    df_itemsets = pd.DataFrame(encoded_array, columns=encoder.columns_)

    Next, you can extract frequent itemsets from the encoded array. This can be done using the apriori() function from the mlxtend.frequent_patterns package, as follows:

    from mlxtend.frequent_patterns import apriori
    frequent_itemsets = apriori(df_itemsets, min_support=0.1, use_colnames=True)

    In this particular example, set the min_support parameter to 0.1 to return the items and itemsets with at least 10% support. As a result, the content of the frequent_itemsets DataFrame should look as follows:

        support                  itemsets
    0      0.10                   (beets)
    1      0.25                 (cabbage)
    2      0.40                  (carrot)
    3      0.10                    (corn)
    4      0.30                (cucumber)
    5      0.10                (eggplant)
    6      0.20                  (pepper)
    7      0.25                  (potato)
    8      0.10                (radishes)
    9      0.10                 (spinach)
    10     0.15                  (tomato)
    11     0.20       (cucumber, cabbage)
    12     0.20          (pepper, carrot)
    13     0.15          (carrot, potato)
    14     0.10          (carrot, tomato)
    15     0.10            (corn, tomato)
    16     0.10          (pepper, potato)
    17     0.10        (potato, radishes)
    18     0.10  (pepper, potato, carrot)

    Now that you have the items and itemsets that meet the specified threshold support, you can proceed to the next step of the Apriori algorithm and generate association rules for the itemsets. The Mlxtend library provides the association_rules() method for doing that.

    from mlxtend.frequent_patterns import association_rules
    rules = association_rules(frequent_itemsets, metric="confidence", min_threshold=0.5)

    If you look at the generated rules, however, you may notice that they include more metrics than support, confidence, and lift. To minimize confusion, if you want to view only those three, you can explicitly specify the following:

    print(rules[['antecedents', 'consequents', 'support', 'confidence', 'lift']])

    The command above should output the following rule set:

             antecedents       consequents  support  confidence      lift
    0         (cucumber)         (cabbage)     0.20    0.666667  2.666667
    1          (cabbage)        (cucumber)     0.20    0.800000  2.666667
    2           (pepper)          (carrot)     0.20    1.000000  2.500000
    3           (carrot)          (pepper)     0.20    0.500000  2.500000
    4           (potato)          (carrot)     0.15    0.600000  1.500000
    5           (tomato)          (carrot)     0.10    0.666667  1.666667
    6             (corn)          (tomato)     0.10    1.000000  6.666667
    7           (tomato)            (corn)     0.10    0.666667  6.666667
    8           (pepper)          (potato)     0.10    0.500000  2.000000
    9         (radishes)          (potato)     0.10    1.000000  4.000000
    10  (pepper, potato)          (carrot)     0.10    1.000000  2.500000
    11  (pepper, carrot)          (potato)     0.10    0.500000  2.000000
    12  (carrot, potato)          (pepper)     0.10    0.666667  3.333333
    13          (pepper)  (carrot, potato)     0.10    0.500000  3.333333

    As you can see, a separate rule is generated for each combination of cabbage and cucumber, as well as for each combination of some other items. The metrics generated for an itemset depend on the order of the items in it. If you swap the antecedent and the consequent in an itemset, the metric confidence may change as well.

    Now that you have generated a set of association rules against your transaction data, how can you take advantage of it and turn it into actionable insights? The following section discusses how you can make recommendations for customers based on the purchase history summarized in the metrics that compose association rules.

    Using the data to make recommendations

    Making recommendations based on the items that customers have already added to their basket is a common technique used by retailers to show customers cross-sell items they might want to purchase. To generate such recommendations, you need to identify the items that customers buy together frequently. This is where association rules come in handy, providing metrics for identifying items frequently bought together.

    To start, you might want to filter the rules you have by a certain metric. For example, you might filter the collection of rules using the threshold set at the lift metric. In the following code snippet, the lift threshold is set to 2.5. I suggest you also rename the antecedents and consequents columns to make it clear that the recommendation list for an antecedent is composed of its most common consequents.

    recommendations = rules[rules['lift'] >= 2.5] [['antecedents','consequents']].rename(columns={'antecedents':'items', 'consequents':'recommendation'}).reset_index(drop=True)

    So, the content of the recommendations DataFrame should look as follows:

                  items    recommendation
    0        (cucumber)         (cabbage)
    1         (cabbage)        (cucumber)
    2          (pepper)          (carrot)
    3          (carrot)          (pepper)
    4            (corn)          (tomato)
    5          (tomato)            (corn)
    6        (radishes)          (potato)
    7  (potato, pepper)          (carrot)
    8  (potato, carrot)          (pepper)
    9          (pepper)  (potato, carrot)

    Looking through the rows above, you may notice that there are rows with duplicate values in the items column. For example, you have two rows for pepper, which is not appropriate for a recommendation system. Here is how to combine such duplicates into a single row.

    recommendations['recommendation'] = recommendations['recommendation'].apply(lambda x: list(x))
    recommendations['recommendation'] = recommendations.groupby(recommendations['items'])['recommendation'].agg(sum).reset_index(drop=True)
    recommendations = recommendations.drop_duplicates(subset=['items'])
    recommendations['recommendation'] = recommendations['recommendation'].apply(lambda x: list(set(x)))

    After that, the rows in the recommendations DataFrame should look as follows:

                  items    recommendation
    0        (cucumber)         [cabbage]
    1         (cabbage)        [cucumber]
    2          (carrot)          [pepper]
    3          (pepper)  [carrot, potato]
    4          (tomato)            [corn]
    5            (corn)          [tomato]
    6        (radishes)          [potato]
    7  (carrot, potato)          [pepper]
    8  (potato, pepper)          [carrot]

    Performing a similar analysis with Oracle Database

    Oracle Database allows you to perform association analysis with the Apriori algorithm. Use the DBMS_DATA_MINING package, a component of the Oracle Advanced Analytics option to Oracle Database Enterprise Edition. (In Oracle Database 21c, Oracle Data Mining was rebranded to Oracle Machine Learning for SQL. The PL/SQL package name, however, has not changed and remains DBMS_DATA_MINING.)

    Here is a simple example of using DBMS_DATA_MINING to generate association rules against a set of transaction data. To begin, you need to have a table or view with transaction data. In its simplest form, such a table might look like the one created below.

    CREATE TABLE transactions (
      trans_id NUMBER(10),
      prod_name VARCHAR2(20)

    The trans_id column is the case identifier column in the table. In this context, a case assumes a collection of items in a single transaction.

    Once you have created the transactions table, you need to populate it with data. For this, you can execute the insert statements found in the transactions.sql file accompanying this article. You might also want to prepare a market_settings table to override some default settings of the association model that you’re going to create next.


    Then, you can change the default values of the settings found in the market_settings table and build the association model, all in a single PL/SQL block.

     UPDATE market_settings
       SET setting_value = TO_CHAR(0.1)
     WHERE setting_name = DBMS_DATA_MINING.asso_min_support;
      model_name           => 'market_model',
      mining_function      => DBMS_DATA_MINING.ASSOCIATION,
      data_table_name      => 'transactions',
      case_id_column_name  => 'trans_id',
      target_column_name   => NULL,
      settings_table_name  => 'market_settings');

    View the association rules generated for the transaction data found in the transactions table with the help of the DBMS_DATA_MINING.GET_ASSOCIATION_RULES table function, as illustrated in the following example:

    SELECT A.attribute_str_value antecedent, C.ATTRIBUTE_STR_VALUE consequent, rule_support support, rule_confidence confidence, rule_lift lift
         TABLE (R.antecedent) A,
         TABLE (R.consequent) C
    WHERE rule_lift >=2.5 AND rule_confidence >=0.5;

    The query above should produce the following output:

    ---------- ---------- ------- -------------- --------------
    pepper     carrot 	  .2	    1	            2.5
    radishes   potato	   .1	    1	            4
    corn       tomato	   .1	    1              6.66666667
    potato     carrot	   .1	    1	            2.5
    pepper     carrot  	 .1	    1	            2.5
    cabbage    cucumber	 .2	   .8              2.66666667
    cucumber   cabbage	  .2     .666666667      2.66666667
    potato     pepper	   .1     .666666667      3.33333333
    carrot     pepper	   .1     .666666667      3.33333333
    tomato     corn  	   .1     .666666667      6.66666667
    carrot     pepper	   .2	   .5	            2.5


    After an item has appeared in a customer basket, which item is going to appear there next? Of course, you cannot say for sure, but you can make a prediction based on the history of customer purchases, using association analysis. In this article, you saw how this type of analysis can be performed on a sample set of transactions using both Python and Oracle Database.

    Illustration: Wes Rowell

Yuli Vasiliev (@VasilievYuli) is a programmer, freelance writer, and consultant specializing in open source development, Oracle Database technologies, and natural-language processing (NLP). He is the author of Natural Language Processing with Python and spaCy.