PDA

View Full Version : Batch Editing XML and text files



Jeff the Green
2013-11-15, 12:18 AM
I know this isn't the best forum to ask this, but I honestly don't know what would be.

I have some XML files I need to create batches of. I have several template XML files, each of which needs to have ~1000 versions made of them with nine fields replaced with strings from one of ~1000 files. These files are .nex, a way of storing gene sequences. The syntax is this:


#NEXUS

BEGIN DATA;
DIMENSIONS NTAX=9 NCHAR=10000;
FORMAT DATATYPE = DNA GAP = - MISSING = ?;
MATRIX
Species1
AGTCAGTCAGTCAGTCAGTCAGTC

Species2
AGTCAGTCAGTCAGTCAGTCAGTC

Though obviously there are more than just Species1 and Species2, those species actually have names, and the sequences are less repetitive and unique to a species. I need to pull the sequences and put them into the appropriate XML fields.

Anyway, I know enough Ruby and Linux scripting that if anyone could give me instructions I could write a script for it. And I'm familiar enough with programming in general that I could edit something to match the particulars.

Like I said, I know this isn't the perfect forum for this, but I don't know where to ask, so if anyone could direct me to an appropriate place I'd appreciate that almost as much as a solution. :smallsmile:

Grinner
2013-11-15, 07:12 AM
I don't know Ruby, so I can't offer anything but general advice.

Assuming you can't just rebuild the XML files with the program, you'll need something to let you navigate and write data to the XML trees. I'm not sure if there are native functions for that, or if you'll need a library of some kind.

Once you've solved that problem, you'll need to extract the sequences from their respective files. If the files are just like you've described here, then you just need to read the whole line into a string.

There are two caveats to this approach. First, are the files set up exactly as you've described? Wikipedia gives an example more like this:


#NEXUS
Begin data;
Dimensions ntax=4 nchar=15;
Format datatype=dna symbols="ACTG" missing=? gap=-;
Matrix
Species1 atgctagctagctcg
Species2 atgcta??tag-tag
Species3 atgttagctag-tgg
Species4 atgttagctag-tag
;
End;

...and those changes will cause the program to start reading junk into the XML files. If so, you would need to adjust the program to cut the first few bytes from the string.

Second, are all the sequences that short? DNA sequences are usually far lengthier. If you don't allocate enough memory to store them, it'll start giving errors.

The last problem you'll need to solve is knowing what needs to be written to where. As you've not mentioned any particulars in that regard, I'll assume you have some kind of a plan for that.

valadil
2013-11-15, 09:14 AM
Glad you said you're comfortable with linux. Unless there's some complexity I'm missing, a bash script should be fine.

Based on what you said I wouldn't worry about them being XML files. I'd make copies of the files and loop over them. Use `sed -i` and some regexes to replace a string in a file. Replace %%% (or whatever placeholder your template uses) with contents from the file.

This sort of approach works great for small scale replacements. If you need something a bit more involved, you'll need to post more details.

TuggyNE
2013-11-15, 08:03 PM
Yeah, regexes are probably your best bet overall. They're designed for text processing, which this is.

GoblinArchmage
2013-11-16, 01:00 AM
Hm. Did you try reconfigurating the Giuseppe Plexus and optimizing the conflasmogostic flurcentosh? Doing that should alienate the clinkerclocks, which in turn will adjust the blinkersticks, allowing you to breach the brachiocodex. That ought to batch those XMLs quite nicely.

Jeff the Green
2013-11-16, 02:34 AM
Glad you said you're comfortable with linux. Unless there's some complexity I'm missing, a bash script should be fine.

Based on what you said I wouldn't worry about them being XML files. I'd make copies of the files and loop over them. Use `sed -i` and some regexes to replace a string in a file. Replace %%% (or whatever placeholder your template uses) with contents from the file.

This sort of approach works great for small scale replacements. If you need something a bit more involved, you'll need to post more details.

Sigh. Yeah, I think that will work; I was just hoping it wouldn't be sed. I've tried off and on to figure it out and failed. Do you know of a good "sed for dummies" resource?


Hm. Did you try reconfigurating the Giuseppe Plexus and optimizing the conflasmogostic flurcentosh? Doing that should alienate the clinkerclocks, which in turn will adjust the blinkersticks, allowing you to breach the brachiocodex. That ought to batch those XMLs quite nicely.

Yeah, but the solarplexus keeps re-codifying into a paleoarch. :smallamused:

valadil
2013-11-16, 09:06 PM
Sigh. Yeah, I think that will work; I was just hoping it wouldn't be sed. I've tried off and on to figure it out and failed. Do you know of a good "sed for dummies" resource?


The best resource for sed is google and a terminal. I've been using it practically daily for at least 10 years. It's been long enough that I can't really remember where I started.

Are you having trouble with sed itself or regular expressions? If you're not familiar with either I can see why sed would be daunting.

There are two modes to sed that I use all the time. -i edits a file. -e edits a stream. (By steam I mean text that's going to be passed to your terminal. If you output some text and pipe it into sed, the -e option will edit that stream of text.) Then you pass it a command in quotes. They look something lie "s/abc/123/". The s command means replace. There are other commands but that's the one I mainly use sed for.

The stuff separated by the slashes are regular expressions. I didn't do any regex magic in there, so they'll just evaluate to strings. The stuff in the first set of slashes is replaced by the stuff from the second. So...



$ echo 'abcdefghijklmnopqrstuvwxyz' | sed -e 's/abc/123/'
123defghijklmnopqrstuvwxyz


You can also use the power of regex to make groups of things. Instead of matching a string of characters, you can tell it to match any element from a set of characters by wrapping that set in braces. For instance, '[aeiou]' will match vowels. Let's replace vowels with underscores



$ echo 'abcdefghijklmnopqrstuvwxyz' | sed -e 's/[aeiou]/_/'
_bcdefghijklmnopqrstuvwxyz


Wait. Why did that only match the a? Well, sed found that match and then it stopped because it's job was done. You have to tell sed to keep replacing all instances of the match. You can do that by adding the letter g after the final slash. There are other regex options like this, but this is one I use most.



$ echo 'abcdefghijklmnopqrstuvwxyz' | sed -e 's/[aeiou]/_/g'
_bcd_fgh_jklmn_pqrst_vwxyz


Putting a caret (^) inside a group negates the group, matching everything except that group.


$ echo 'abcdefghijklmnopqrstuvwxyz' | sed -e 's/[^aeiou]/_/g'
a___e___i_____o_____u_____


Period is a magic letter. It'll match *anything*. Let's say we wanted sed to eat all vowels, and whatever letter comes after a vowel.


$ echo 'abcdefghijklmnopqrstuvwxyz' | sed -e 's/[aeiou]./_/g'
_cd_gh_klmn_qrst_wxyz


Well that doesn't look quote right because we replaced two letters with one. Let's use a double underscore instead.


$ echo 'abcdefghijklmnopqrstuvwxyz' | sed -e 's/[aeiou]./__/g'
__cd__gh__klmn__qrst__wxyz


(Note, if you ever want to match an actual period, escape it. Put a backslash (\) before the period and it won't be a magic period anymore. That goes for other special regex characters, like the slashes delimiting the regex.

You can match multiple characters with a *. Put one of these after something and that thing will just keep matching for as long as it can, until it absolutely has to stop. I'm not sure that makes sense, so here's an example. This will delete all the letters between b and k.


$ echo 'abcdefghijklmnopqrstuvwxyz' | sed -e 's/b.*k/_/g'
a_lmnopqrstuvwxyz


That's all I can think of that I'd expect a basic sed tutorial to go through. I'm sure there's more though, so ask questions wherever you get stuck.

Jeff the Green
2013-11-16, 09:21 PM
The best resource for sed is google and a terminal. I've been using it practically daily for at least 10 years. It's been long enough that I can't really remember where I started.

Are you having trouble with sed itself or regular expressions? If you're not familiar with either I can see why sed would be daunting.

Oh, I've used both for simple things. It's moving something from one file to another that I've had trouble with. Best I've been able to come up with is using sed to create a shell variable of the proper sequence in the .nex file and then sed again to replace the bit in the xml file with the shell variable. I think that'll work, though at the moment I don't have a terminal to work with (on a Chromebook). I'm a little hazy on how to get the sequence into a shell variable, though.

valadil
2013-11-17, 10:50 PM
Does the sequence exist in a file? You could do something like


VAR=$(cat $FILE | grep 'the lines you want to match')