Sed to replace variable length string between 2 known patterns -


i'd able replace string between 2 known patterns. catch want replace string of same length composed of 'x'.

let's have file containing:

hello.stringtobereplaced.secondstring hello.shortstring.secondstring 

i'd output this:

hello.xxxxxxxxxxxxxxxxxx.secondstring hello.xxxxxxxxxxx.secondstring 

using sed loops

you can use sed, though thinking required not wholly obvious:

sed ':a;s/^\(hello\.x*\)[^x]\(.*\.secondstring\)/\1x\2/;t a' 

this gnu sed; bsd (mac os x) sed , other versions may fussier , require:

sed -e ':a' -e 's/^\(hello\.x*\)[^x]\(.*\.secondstring\)/\1x\2/' -e 't a' 

the logic identical in both:

  • create label a
  • substitute lead string , sequence of x's (capture 1), followed non-x, , arbitrary other data plus second string (capture 2), , replace contents of capture 1, x , content of capture 2.
  • if s/// command made change, go label a.

it stops substituting when there no non-x's between 2 marker strings.

two tweaks regex allow code recognize 2 copies of pattern on single line. lose ^ anchors match beginning of line, , change .* [^.]* (so regex not quite greedy):

$ echo hello.stringtobereplaced.secondstring hello.stringtobereplaced.secondstring | > sed ':a;s/\(hello\.x*\)[^x]\([^.]*\.secondstring\)/\1x\2/;t a' hello.xxxxxxxxxxxxxxxxxx.secondstring hello.xxxxxxxxxxxxxxxxxx.secondstring $ 

using hold space

hek2mgl suggests alternative approach in sed using hold space. can implemented using:

$ echo hello.stringtobereplaced.secondstring | > sed 's/^\(hello\.\)\([^.]\{1,\}\)\(\.secondstring\)/\1@\3@@\2/ >      h >      s/.*@@// >      s/./x/g >      g >      s/\(x*\)\n\([^@]*\)@\([^@]*\)@@.*/\2\1\3/ >      ' hello.xxxxxxxxxxxxxxxxxx.secondstring $ 

this script not robust looping version works ok written when each line matches lead-middle-tail pattern. first splits line 3 sections: first marker, bit mangled, , second marker. reorganizes 2 markers separated @, followed @@ , bit mangled. h copies result hold space. remove , including @@; replace each character in bit mangled x, copy material in hold space after x's in pattern space, newline separating them. finally, recognize , capture x's, lead marker, , tail marker, ignoring newline, @ , @@ plus trailing material, , reassemble lead marker, x's, , tail marker.

to make robust, you'd recognize pattern , group commands shown inside { , } group them they're executed when pattern recognized:

sed '/^\(hello\.\)\([^.]\{1,\}\)\(\.secondstring\)/{      s/^\(hello\.\)\([^.]\{1,\}\)\(\.secondstring\)/\1@\3@@\2/      h      s/.*@@//      s/./x/g      g      s/\(x*\)\n\([^@]*\)@\([^@]*\)@@.*/\2\1\3/      }' 

adjust suit needs...

adjusting suit needs

[i tried 1 of solutions , worked fine.] when try replace 'hello' real string (which '1.2.840.') , second string (which dot '.'), things stop working. guess these dots confuse sed command. try achieve transform '1.2.840.10008.' '1.2.840.xxxxx.'

and pattern happens several times in file variable number of characters replaced between '1.2.840.' , next dot '.'

there times when important question close enough real scenario — may 1 such. dot metacharacter in sed regular expressions (and in other dialects of regular expression — shell globbing being noticeable exception). if 'bit mangled' digits, can tighten regular expressions, though (when @ code ahead) tightening isn't imposing in way of restriction.

pretty solution using regular expressions balancing act has pit convenience , abbreviation against reliability , precision.

revised code plus data

cat <<eof | transform '1.2.840.10008.' '1.2.840.xxxxx.' ok, , hence 1.2.840.21. , 1.2.840.20992. should lose 21 , 20992. eof  sed ':a;s/\(1\.2\.840\.x*\)[^x.]\([^.]*\.\)/\1x\2/;t a' 

example output:

transform '1.2.840.xxxxx.' '1.2.840.xxxxx.' ok, , hence 1.2.840.xx. , 1.2.840.xxxxx. should lose 21 , 20992. 

the changes in script are:

sed ':a;s/\(1\.2\.840\.x*\)[^x.]\([^.]*\.\)/\1x\2/;t a' 
  1. add 1\.2\.840\. start pattern.
  2. revise 'character replace' expression 'not x or .'.
  3. use \. tail pattern.

you replace [^x.] [0-9] if you're sure want digits matched, in case won't have worry spaces discussed below.

you may decide don't want spaces matched casual comment like:

the net prefix 1.2.840. , there other prefixes too. 

does not end as:

the net prefix 1.2.840.xxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxx. 

in case, need use:

sed ':a;s/\(1\.2\.840\.x*\)[^x. ]\([^ .]*\.\)/\1x\2/;t a' 

and changes continue until you've got precise enough want without doing don't want on current data set. writing bullet-proof regular expressions requires precise specification of want matched, , can quite hard.


Comments

Popular posts from this blog

c++ - Delete matches in OpenCV (Keypoints and descriptors) -

java - Could not locate OpenAL library -

sorting - opencl Bitonic sort with 64 bits keys -