What is difflib in python?

This module in the python standard library provides classes and functions for comparing sequences like strings, lists etc. In this article we will look into the basics of SequenceMatcher, get_close_matches and Differ.

SequenceMatcher a flexible class for comparing pairs of sequences of any type, so long as the sequence elements are hashable. The basic algorithm predates, and is a little fancier than, an algorithm published in the late 1980’s by Ratcliff and Obershelp under the hyperbolic name “gestalt pattern matching.” The idea is to find the longest contiguous matching subsequence that contains no “junk” elements; these “junk” elements are ones that are uninteresting in some sense, such as blank lines or whitespace. (Handling junk is an extension to the Ratcliff and Obershelp algorithm.) The same idea is then applied recursively to the pieces of the sequences to the left and to the right of the matching subsequence. This does not yield minimal edit sequences, but does tend to yield matches that look right to people.

So, let’s see how to use it.

from difflib import SequenceMatcherstr1 = 'abcd'
str2 = 'abcde'
seq = SequenceMatcher(a=str1, b=str2)
print(seq.ratio())

The SequenceMatcher class accepts two pararmeters a and b and it compares the similarity of b to a and gives us a score or ratio of similarity. So the above code outputs 0.88888 ratio which means str2 is 80% similar to str1 .

The get_close_matches function gives us the top similar words from a list that is similar to a given string.

from difflib import get_close_matches
word_list = ['acdefgh', 'abcd','adef','cdea']
str1 = 'abcd'
matches = get_close_matches(str1, word_list, n=2, cutoff=0.3)
print(matches)

Here n is the number of top similar words we want in the output and cutoff is the minimum ratio value required for that word in order to classify it as similar. So this piece outputs ['abcd', 'abcdefgh'] , if we increase the cutoff to 0.7 it will only output ['abcd'] as that is the only word in the list that will give a similarity ratio of >0.7. This function comes in very handy when making a quick ‘typo detection code’ , for example if we write ‘appl’ it can suggest did you mean ‘apple’.

The Differ class provides a human readable of the deltas in two sequences.

from diffib import Differ
from pprint import pprint

txt1 = '''
hello world.
we like python.'''.splitlines()

txt2 = '''
hello world.
we like python coding'''.splitlines()
dif = Differ()df = list(dif.compare(txt1, txt2))pprint(df)

This gives us an output like this.

What is difflib in python?

output of Differ.compare()

Here we can see that it compares txt2 with txt1 and gives us a human readable structure showing what changed in txt2 from txt1.

As we can see here ‘hello world’ is same in both the sequences but the second sentence has changed and its showing that ‘coding’ is the change in the second sentence of both the strings. Here’s the video tutorial for this

There are lot more cool and complex functions in the module difflib , do check out the official python documentation of this module.

So let's get started with this amazing python module Difflib

Photo by Maxwell Nelson on Unsplash

Introduction:

Let's say you have a use case of getting similar keywords for every keyword present in the column. So how can we do that? Firstly, we can use the structure of the embedding to calculate the cosine similarity between every keyword in the column one by one and then map it map by the highest cosine similarity score. But calculating embeddings and then cosine similarity will be a computationally heavy task and if the list is large then it will take a lot of time as well. So here comes this amazing python module for our rescue. Difflib is a module that provides functions for comparing the sequences. It could be used for comparing strings and get additional information regarding them.

Functions:

1. difflib.SequenceMatcher : Sequence Matcher is the class in the Difflib module. This class is used for comparing the strings and get the score of similarity between two strings. It finds the longest matching sequence between two strings excluding the spaces and white lines.

Example :

import difflib
a = 'Medium'
b = 'Median'
seq = difflib.SequenceMatcher(None,a,b)
d = seq.ratio()*100
print(d)
66.66666666666666

We can see from the above block of code when we compare ‘Medium’ and ‘Median’, we get 66.6% similarity.

a = 'Medium'
b = 'Mediun'
seq = difflib.SequenceMatcher(None,a,b)
d = seq.ratio()*100
print(d)
83.33333333333334

and when we change string b to ‘Mediun’ our similarity ratio goes up to 83.3%. This is because in the first example there was a difference of two-character whereas in the second example only one character is different.

In the SequenceMatcher function there are 4 parameters to be specified : isjunk, string a, string b, autojunk.

2. difflib.get_close_matches : get_close_match is a function that returns a list of best matches keywords for a specific keyword. So when we feed the input string and list of strings in get_close_match function it will return the list of strings which are matching with the input string.

It has parameters such as n, cutoff where n is the maximum number of close matches to return and cutoff is a float number which denotes the possibility that whichever words have scores below the cutoff are ignored.

Example:

get_close_matches('appel', ['ape', 'apple', 'peach', 'puppy'])
['apple', 'ape']

Here for the input word ‘appel’ we get ‘apple’ and ‘ape’ as the most similar words. Let's check another example:

import difflibdifflib.get_close_matches('when', ['what', 'whene','where','why'], n=2, cutoff=0.8)
['whene']

In this example we have added parameters like n and cutoff. As we have cutoff range 0.8 we are not getting where as the close match but if we lower the cutoff range we will get where as well.

There are many other functions in difflib such as difflib.Differ, difflib.HtmlDiff, difflib.context_diff, difflib.ndiff, difflib.restore, difflib.unified_diffwhich are used as per the use case.

For more detailed information on any of thess functions do check out the official documentation of Difflib module : https://docs.python.org/3/library/difflib.html

Thank You!

Is Difflib built

Difflib — A hidden gem in Python built-in libraries One of the examples is the built-in library I'm going to introduce in this article — Difflib. Because it is built-in to Python3, so we don't need to download or install anything, simply import it as follows.

How does Difflib SequenceMatcher work?

SequenceMatcher is a class that is available in the difflib Python package. The difflib module provides classes and functions for comparing sequences. It can be used to compare files and can produce information about file differences in various formats. This class can be used to compare two input sequences or strings.

What is sequence Matcher in Python?

SequenceMatcher is a class available in python module named “difflib”. It can be used for comparing pairs of input sequences. The objective of this article is to explain the SequenceMatcher algorithm through an illustrative example.

What algorithm does SequenceMatcher use?

SequenceMatcher is a flexible class for comparing pairs of sequences of any type, so long as the sequence elements are hashable. The basic algorithm predates, and is a little fancier than, an algorithm published in the late 1980's by Ratcliff and Obershelp under the hyperbolic name "gestalt pattern matching".