Wednesday, November 26, 2008

Using Perl for Mass In Place Editing

Have you ever wanted to update some text in a bunch of files all at once without a hassle?  I had reason to do this recently and turned to Perl for my solution.  Once I found the implementation quirks, it turned out to be quite easy.  This functionality is pretty well documented, but not for Windows.  There are some gotchas trying to do it there.  What follows is what I found to work.


I had a set of files that were database queries for a particular milestone in our product release.  Let's call it M1.  I needed to update them all for a different milestone that we'll call M2.  Each file was an XML version of a query so text parsing seemed like the logical course of action.  Perl is good at text processing so I went that route.  First off I had to install Perl.


Perl is capable of accepting a single line of code at the command prompt so there is no reason to actually author a perl script.  The command for this is -e (execute?).  To change something inline, you can use the -i (in place) command.  The command looks something like this:



perl -pi.bak -e "s/M1/M2/g" file.txt


The -i.bak means to rename the original files with the .bak extention.  In theory -i alone will delete the originals but ActivePerl wouldn't accept this. 


The -p tells perl to run the command over every line of standard input (i.e. over every line of the file).


The "s/M1/M2/g" command is a regular expression telling it to substitute M2 for M1 globally.  It could be any regular expression.  Note that most examples of this online use only single quotes ( ' ), but this doesn't work on Windows.  One hint:  If the command fails, try adding -w to the command line to generate warnings. 


The above command will change all instances of M1 to M2 in file.txt.  What I wanted to do was to replace it in every file.  Simple, I'll just change file.txt to *.*.  Sorry, no dice.  ActivePerl doesn't accept this nor does it accept *.  Time for some more command-line action.  There is a for command that can be utilized at the cmd prompt which fits the bill.  Use it like so:



for %i in (*.txt) do perl -pi.bak -e "s/M1/M2/g" "%i"


This command will iterate over all the files (*.txt) and execute the command following the do.  You have to quote the trailing %i because filenames might contain spaces.


There you go, quick and dirty text replacement from the command line.  Note that perl regular expressions are capable of much more than simple search and replace.  You can use this technique to accomplish anything they can.


Is there an even simpler way to do this?  There probably is.  If so, please let me know in the comments.


 


12/4/08 - Updated with -p thanks to Maurits.

6 comments:

  1. `rpl' for *nix was a saviour for doing this across entire subdirectory structures without having to hack up any perl or find/grep/perl crazyness.  And it does multi-line replacements, which is something I had some pain trying to get working in perl before I found it  :)
    http://www.laffeycomputer.com/rpl.html
    Looks like there have been ports to windows too:
    http://sourceforge.net/project/shownotes.php?release_id=614424

    ReplyDelete
  2. Hmmm... looks like your command lines are missing a -p.
    FWIW, this /does/ work in UNIX shells:
    perl -pi.bak -e "s/M1/M2/g" *.txt
    ... but only because the shell replaces the *.txt with the expanded list of files prior to invoking the perl command.
    This works in Windows too, but doesn't scale to large numbers of files:
    perl -pi.bak -e "s/M1/M2/g" query1.txt query2.txt query3.txt

    ReplyDelete
  3. Maurits, the -p says to iterate over the -e command for each item on the command line.  Because I'm only sending one item at a time, it's unnecessary.

    ReplyDelete
  4. Without the -p I get this, and the file is unchanged:
    >perl -w -i.bak -e "s/banana/grapefruit/g" one.txt
    Use of uninitialized value in substitution (s///) at -e line 1.
    With the -p I get no output and the file is changed.
    -i takes care of iterating over each file on the command line and feeding STDIN and STDOUT to the right places.
    What -p does is:
    1) execute the program (the -e in this case) for each line of <STDIN>. (-n also does this.)
    2) after the program has run, print the final value of $_ to STDOUT. (-n doesn't do this though.)

    ReplyDelete
  5. Actually, -i takes care of iterating over each file and redirecting STDIN and STDOUT as appropriate.
    -p takes care of iterating over each line of STDIN, assigning that line to $_, running the program (or -e), and adding an extra "print;" at the end of each iteration.

    ReplyDelete
  6. The non-expansion of wildcards reflects the Unix roots of perl as in that world wildcard expansion is a responsibility of the shell instead of each individual program.
    However you might want to check out http://www.perl.com/doc/manual/html/Porting/README.win32.html and scroll down to the "Command-line Wildcard Expansion" section for a pretty easy and transparent solution. This should let you get rid of your for loop.
    If you do a lot of editing like this it might be worth your while to actually install a version of sed. You are already using vim and now perl so why not go the whole hog and get the rest of the Unix command line tools like sed, grep, awk etc? The Microsoft alternative would probably be to do the above operation with powershell:-
    gc foo.txt | % { $_ -replace <pattern>, <replacement> }
    Andrew.

    ReplyDelete