Cara menggunakan python diff-match-patch

Learn the fundamentals of neural networks and how to build deep learning models using Keras 2.0 in Python.

See DetailsRight Arrow

Start Course

Introduction to Natural Language Processing in Python

Beginner

4 hr

94.2K

Learn fundamental natural language processing techniques using Python and how to apply them to extract insights from real-world text data.

Google's diff-match-patch API is the same for all languages that it is implemented in (Java, JavaScript, Dart, C++, C#, Objective C, Lua and Python 2.x or python 3.x). Therefore one can typically use sample snippets in languages other than one's target language to figure out which particular API calls are needed for various diff/match/patch tasks .

In the case of a simple "semantic" comparison this is what you need

import diff_match_patch

textA = "the cat in the red hat"
textB = "the feline in the blue hat"

#create a diff_match_patch object
dmp = diff_match_patch.diff_match_patch()

# Depending on the kind of text you work with, in term of overall length
# and complexity, you may want to extend (or here suppress) the
# time_out feature
dmp.Diff_Timeout = 0   # or some other value, default is 1.0 seconds

# All 'diff' jobs start with invoking diff_main()
diffs = dmp.diff_main(textA, textB)

# diff_cleanupSemantic() is used to make the diffs array more "human" readable
dmp.diff_cleanupSemantic(diffs)

# and if you want the results as some ready to display HMTL snippet
htmlSnippet = dmp.diff_prettyHtml(diffs)


A word on "semantic" processing by diff-match-patch
Beware that such processing is useful to present the differences to a human viewer because it tends to produce a shorter list of differences by avoiding non-relevant resynchronization of the texts (when for example two distinct words happen to have common letters in their mid). The results produced however are far from perfect, as this processing is just simple heuristics based on the length of differences and surface patterns etc. rather than actual NLP processing based on lexicons and other semantic-level devices.
For example, the textA and textB values used above produce the following "before-and-after-diff_cleanupSemantic" values for the diffs array

[(0, 'the '), (-1, 'cat'), (1, 'feline'), (0, ' in the '), (-1, 'r'), (1, 'blu'), (0, 'e'), (-1, 'd'), (0, ' hat')]
[(0, 'the '), (-1, 'cat'), (1, 'feline'), (0, ' in the '), (-1, 'red'), (1, 'blue'), (0, ' hat')]

Nice! the letter 'e' that is common to red and blue causes the diff_main() to see this area of the text as four edits, but the cleanupSemantic() fixes as just two edits, nicely singling out the different sems 'blue' and 'red'.

However, if we have, for example

textA = "stackoverflow is cool"
textb = "so is very cool"

The before/after arrays produced are:

[(0, 's'), (-1, 'tack'), (0, 'o'), (-1, 'verflow'), (0, ' is'), (1, ' very'), (0, ' cool')]
[(0, 's'), (-1, 'tackoverflow is'), (1, 'o is very'), (0, ' cool')]

Which shows that the allegedly semantically improved after can be rather unduly "tortured" compared to the before. Note, for example, how the leading 's' is kept as a match and how the added 'very' word is mixed with parts of the 'is cool' expression. Ideally, we'd probably expect something like

Further analysis of the maintenance status of diff-match-patch-python based on released PyPI versions cadence, the repository activity, and other data points determined that its maintenance is Inactive.

An important project maintenance signal to consider for diff-match-patch-python is that it hasn't seen any new versions released to PyPI in the past 12 months, and could be considered as a discontinued project, or that which receives low attention from its maintainers.

In the past month we didn't find any pull request activity or change in issues status has been detected for the GitHub repository.

function stringSimilarity(text1, text2) { const dmp = new DiffMatchPatch() dmp.Diff_Timeout = 0.1 const diff = dmp.diff_main(text1, text2) dmp.diff_cleanupSemantic(diff) const distance = dmp.diff_levenshtein(diff) const maxDistance = Math.max(text1.length, text2.length) const similarity = 1 - distance / maxDistance return similarity }

The Diff Match and Patch libraries offer robust algorithms to perform the operations required for synchronizing plain text.

  1. Diff:
    • Compare two blocks of plain text and efficiently return a list of differences.
    • Diff Demo
  2. Match:
    • Given a search string, find its best fuzzy match in a block of plain text. Weighted for both accuracy and location.
    • Match Demo
  3. Patch:
    • Apply a list of patches onto plain text. Use best-effort to apply patch even when the underlying text doesn't match.
    • Patch Demo

Originally built in 2006 to power Google Docs, this library is now available in C++, C#, Dart, Java, JavaScript, Lua, Objective C, and Python.

Reference

  • API - Common API across all languages.
  • Line or Word Diffs - Less detailed diffs.
  • Plain Text vs. Structured Content - How to deal with data like XML.
  • Unidiff - The patch serialization format.
  • - Newsgroup for developers.

Languages

Although each language port of Diff Match Patch uses the same API, there are some language-specific notes.

  • C++
  • C#
  • Dart
  • Java
  • JavaScript
  • Lua
  • Objective-C
  • Python

A standardized speed test tracks the in each language.

Algorithms

This library implements Myer's diff algorithm which is generally considered to be the best general-purpose diff. A layer of pre-diff speedups and post-diff cleanups surround the diff algorithm, improving both performance and output quality.

This library also implements a Bitap matching algorithm at the heart of a flexible matching and patching strategy.