java - Extract only the text from entity-encoded HTML inside an XML element -


i'm working on xml parsing android application have problem. page i'm going parse has element <description> in there appears entity-encoded html not wanted. structure:

<description>&lt;img src="http://images.website.it/thumbs/images/2014/12/16/asd_crop_upscale_q85.jpg" alt="post img " style="float:left;margin-right:10px"/&gt; lorem ipsum...</description> 

what want lorem ipsum... part , none of encoded html inside tag <description>. doinbackground part of asynctask

@override          protected void doinbackground(void... args) {            element e = null;            xmlparser parser = new xmlparser();            string xml = parser.getxmlfromurl(url); // getting xml            document doc = parser.getdomelement(xml); // getting dom element            nodelist nl = doc.getelementsbytagname(key_item);             // looping through item nodes <item>            (int = 0; < nl.getlength(); i++) {                // creating new hashmap                hashmap<string, string> map = new hashmap<string, string>();                e = (element) nl.item(i);                // adding each child node hashmap key => value                map.put(key_title, parser.getvalue(e, key_title));                map.put(key_desc, parser.getvalue(e, key_desc));                map.put(key_date, parser.getvalue(e, key_date));                map.put(key_link, parser.getvalue(e, key_link));                 string descriptionelementcontent = key_desc;                textonly = removeentityencodedhtmltags(descriptionelementcontent);                 // adding hashlist arraylist                menuitems.add(map);            }            system.out.println("prova3");            (int c = 0; c < nl.getlength(); c++) {                e = (element) nl.item(c);                titoli[c] = parser.getvalue(e, key_title); // name child value                descrizioni[c] = parser.getvalue(e, textonly);                date[c] = parser.getvalue(e, key_date);                links[c] = parser.getvalue(e, key_link);            }            return null;        }         public string removeentityencodedhtmltags(string rawstring) {            matcher tagmatcher = entity_encoded_html_tag.matcher(rawstring);            return tagmatcher.replaceall("").trim();        }          @override           protected void onpostexecute(void s) {             adapter = new simpleadapter(getactivity(), menuitems, r.layout.post_list_item,                      new string[]                               { key_title,                                 textonly,                                 key_date                                }, new int[]                                        {                                         r.id.title,                                         r.id.description,                                          r.id.date                                     });              system.out.println("prova5");             listview.setadapter(adapter);             pdialog.dismiss();                   } 

it's first time i've parsed xml don't know how can done.

this xmlparser class way:

public class xmlparser {      // constructor     public xmlparser() {      }      /**      * getting xml url making http request      *      * @param url string      */     public string getxmlfromurl(string url) {         string xml = null;          try {             // defaulthttpclient             defaulthttpclient httpclient = new defaulthttpclient();             httppost httppost = new httppost(url);              httpresponse httpresponse = httpclient.execute(httppost);             httpentity httpentity = httpresponse.getentity();             xml = entityutils.tostring(httpentity);          } catch (unsupportedencodingexception e) {             e.printstacktrace();         } catch (clientprotocolexception e) {             e.printstacktrace();         } catch (ioexception e) {             e.printstacktrace();         }         // return xml         return xml;     }      /**      * getting xml dom element      *      * @param xml string      */     public document getdomelement(string xml) {         document doc = null;         documentbuilderfactory dbf = documentbuilderfactory.newinstance();         try {              documentbuilder db = dbf.newdocumentbuilder();              inputsource = new inputsource();             is.setcharacterstream(new stringreader(xml));             doc = db.parse(is);          } catch (parserconfigurationexception e) {             log.e("error: ", e.getmessage());             return null;         } catch (saxexception e) {             log.e("error: ", e.getmessage());             return null;         } catch (ioexception e) {             log.e("error: ", e.getmessage());             return null;         }          return doc;     }      /**      * getting node value      *      * @param elem element      */     public final string getelementvalue(node elem) {         node child;         if (elem != null) {             if (elem.haschildnodes()) {                 (child = elem.getfirstchild(); child != null;                         child = child.getnextsibling()) {                     if (child.getnodetype() == node.text_node) {                         return child.getnodevalue();                     }                 }             }         }         return "";     }      /**      * getting node value      *      * @param element node      * @param key string      */     public string getvalue(element item, string str) {         nodelist n = item.getelementsbytagname(str);         return this.getelementvalue(n.item(0));     } } 

i editet quest new doinbackground method. nothing change

if want discard of entity-encoded html found within description element use regular expression find encoded html tags , replace them empty string, trim resulting string rid of unwanted leading , trailing spaces.

you can using string.replaceall(string regex, string replacement) method if you're going more once it's worth creating pattern object once , using each time. avoids java runtime having compile same regular expression more once.

i've tested code , works against example xml in question:

private static final pattern entity_encoded_html_tag =         pattern.compile("&lt;.*?&gt;");  public static void main(string[] args) {     string descriptionelementcontent =             "&lt;img src=\"http://images.website.it/thumbs/images/2014/12/16/asd_crop_upscale_q85.jpg\" alt=\"post img \" style=\"float:left;margin-right:10px\"/&gt; lorem ipsum... &lt;br /&gt;";     string textonly = removeentityencodedhtmltags(             descriptionelementcontent);     system.out.println(textonly);        }  public static string removeentityencodedhtmltags(string rawstring) {     matcher tagmatcher = entity_encoded_html_tag.matcher(rawstring);     return tagmatcher.replaceall("").trim(); } 

simply replace main method in above code own loop through description elements.

the regex pattern &lt;.*?&gt; says "match sequence &lt; , content until first occurrence of &gt;". pattern used instantiate single static pattern object can used (over , on again) create matcher object each raw string pass method. matcher.replaceall method replaces every match found (in raw string) empty string rid of result, leaving text only.


Comments

Popular posts from this blog

c++ - Delete matches in OpenCV (Keypoints and descriptors) -

java - Could not locate OpenAL library -

sorting - opencl Bitonic sort with 64 bits keys -