Examination of inappropriate headings



  • There's a method that heds for the table.

    private List<String> headers(String html)
    {
        Document doc = Jsoup.parse(html);
        ArrayList<String> result = new ArrayList<>();
        Elements header;
        Element firstThead = doc.select("thead").first();
        Elements trOfFirstThead = firstThead.children();
        for (Element tr : firstThead.children())
        {
            Elements select = tr.select("th");
            for (Element th : select)
            {
                String s = th.attributes().get("rowspan");
                if (!s.isEmpty() && s.equals(String.valueOf(trOfFirstThead.size())))
                {
                    result.add(th.text());
                }
            }
        }
        header = trOfFirstThead.last().children();
    
    for (Element element : header)
    {
        if (element.tag().getName().equals(tag_th))
        {
            result.add(element.text());
        }
    }
    return result;
    

    }

    The essence of the method is that the entry contains a table with a section thead Hedners must be obtained in the form of a string collection. If hedners are located in several rows, the ranks of the lower row shall be selected and the names shall be taken.

    This algorithm works for tables submitted under numbers 12and 3 (see annex). But it's not right for type 4 table.

    Required collection:

    h4 h10 h11 h12 h6 h7

    In the algorithm, the collection is as follows:

    h4 h7 h10 h11 h12

    Please help the council/algorithm how the necessary behaviour can be realized.

    P. S. Table reference code.

    <html>
        <head>
         <meta http-equiv="content-type" content="text/html; charset=utf-8">
        </head>
        <body>
        <div>
         <ul>
          <li>
           <div>
            <p>Таблица 1</p>
            <table border="1">
             <thead>
             <tr>
              <th>h1</th>
              <th>h2</th>
              <th>h3</th>
             </tr>
             </thead>
             <tbody>
             <tr>
              <td>1</td>
              <td>2</td>
              <td>3</td>
             </tr>
             <tr>
              <td>4</td>
              <td>5</td>
              <td>6</td>
             </tr>
             <tr>
              <td>7</td>
              <td>8</td>
              <td>9</td>
             </tr>
             </tbody>
            </table>
           </div>
          </li>
    
      &lt;li&gt;
       &lt;div&gt;
        &lt;p&gt;Таблица 2&lt;/p&gt;
        &lt;table border="1"&gt;
         &lt;thead&gt;
         &lt;tr&gt;
          &lt;th colspan="2"&gt;h4&lt;/th&gt;
          &lt;th&gt;h5&lt;/th&gt;
         &lt;/tr&gt;
         &lt;tr&gt;
          &lt;th&gt;h1&lt;/th&gt;
          &lt;th&gt;h2&lt;/th&gt;
          &lt;th&gt;h3&lt;/th&gt;
         &lt;/tr&gt;
    
         &lt;/thead&gt;
         &lt;tbody&gt;
         &lt;tr&gt;
          &lt;td&gt;1&lt;/td&gt;
          &lt;td&gt;2&lt;/td&gt;
          &lt;td&gt;3&lt;/td&gt;
         &lt;/tr&gt;
         &lt;tr&gt;
          &lt;td&gt;4&lt;/td&gt;
          &lt;td&gt;5&lt;/td&gt;
          &lt;td&gt;6&lt;/td&gt;
         &lt;/tr&gt;
         &lt;tr&gt;
          &lt;td&gt;7&lt;/td&gt;
          &lt;td&gt;8&lt;/td&gt;
          &lt;td&gt;9&lt;/td&gt;
         &lt;/tr&gt;
         &lt;/tbody&gt;
        &lt;/table&gt;
       &lt;/div&gt;
      &lt;/li&gt;
      
      &lt;li&gt;
       &lt;div&gt;
        &lt;p&gt;Таблица 3&lt;/p&gt;
        &lt;table border="1"&gt;
         &lt;thead&gt;
         &lt;tr&gt;
          &lt;th rowspan="2"&gt;h4&lt;/th&gt;
          &lt;th colspan="2"&gt;h5&lt;/th&gt;
         &lt;/tr&gt;
         &lt;tr&gt;
          &lt;th&gt;h1&lt;/th&gt;
          &lt;th&gt;h2&lt;/th&gt;
         &lt;/tr&gt;
    
         &lt;/thead&gt;
         &lt;tbody&gt;
         &lt;tr&gt;
          &lt;td&gt;1&lt;/td&gt;
          &lt;td&gt;2&lt;/td&gt;
          &lt;td&gt;3&lt;/td&gt;
         &lt;/tr&gt;
         &lt;tr&gt;
          &lt;td&gt;4&lt;/td&gt;
          &lt;td&gt;5&lt;/td&gt;
          &lt;td&gt;6&lt;/td&gt;
         &lt;/tr&gt;
         &lt;tr&gt;
          &lt;td&gt;7&lt;/td&gt;
          &lt;td&gt;8&lt;/td&gt;
          &lt;td&gt;9&lt;/td&gt;
         &lt;/tr&gt;
         &lt;/tbody&gt;
        &lt;/table&gt;
       
       &lt;/div&gt;
      &lt;/li&gt;
      &lt;li&gt;
       &lt;div&gt;
        &lt;p&gt;Таблица 4&lt;/p&gt;
        &lt;table border="1"&gt;
         &lt;thead&gt;
         &lt;tr&gt;
          &lt;th rowspan="3"&gt;h4&lt;/th&gt;
          &lt;th colspan="4"&gt;h5&lt;/th&gt;
          &lt;th rowspan="3"&gt;h7&lt;/th&gt;
         &lt;/tr&gt;
         &lt;tr&gt;
          &lt;th&gt;h1&lt;/th&gt;
          &lt;th&gt;h2&lt;/th&gt;
          &lt;th&gt;h8&lt;/th&gt;
          &lt;th rowspan="2"&gt;h6&lt;/th&gt;
         &lt;/tr&gt;
         &lt;tr&gt;
          &lt;th&gt;h10&lt;/th&gt;
          &lt;th&gt;h11&lt;/th&gt;
          &lt;th&gt;h12&lt;/th&gt;
         &lt;/tr&gt;
    
         &lt;/thead&gt;
         &lt;tbody&gt;
         &lt;tr&gt;
          &lt;td&gt;1&lt;/td&gt;
          &lt;td&gt;2&lt;/td&gt;
          &lt;td&gt;3&lt;/td&gt;
          &lt;td&gt;4&lt;/td&gt;
          &lt;td&gt;5&lt;/td&gt;
          &lt;td&gt;6&lt;/td&gt;
         &lt;/tr&gt;
         &lt;tr&gt;
          &lt;td&gt;7&lt;/td&gt;
          &lt;td&gt;8&lt;/td&gt;
          &lt;td&gt;9&lt;/td&gt;
          &lt;td&gt;10&lt;/td&gt;
          &lt;td&gt;11&lt;/td&gt;
          &lt;td&gt;12&lt;/td&gt;
         &lt;/tr&gt;
         &lt;tr&gt;
    
         &lt;/tr&gt;
         &lt;/tbody&gt;
        &lt;/table&gt;
       &lt;/div&gt;
      &lt;/li&gt;
     &lt;/ul&gt;
    &lt;/div&gt;
    &lt;/body&gt;
    &lt;/html&gt;</code></pre></div></div></p>


  • I think the decision will be implemented to some extent in HTML5 http://www.w3.org/TR/html5/tabular-data.html#forming-a-table good treatment <thead> It's simple. This code gives the right result in your examples:

    static class TableHeader {
        private String[][] cells;
        private int y_height = 0;
        private int x_width = 0;
    
    public TableHeader( int rows, int columns, Element thead ) {
        cells = new String[rows][columns];
    
        parseTHead( thead );
    }
    
    private void ensureCapacity( int rows, int columns ) {
        if ( rows &lt;= cells.length &amp;&amp; columns &lt;= cells[0].length ) return;
    
        int nRows = Math.max( cells.length, rows );
        int nColumns = Math.max( cells[0].length, columns );
    
        String[][] newCells = new String[nRows][nColumns];
        for ( int row = 0; row &lt; cells.length; row++ ) {
            System.arraycopy(cells[row], 0, newCells[row], 0, cells[row].length );
        }
    
        cells = newCells;
    }
    
    private void fill( String cellValue, int row, int col, int rowspan, int colspan ) {
        ensureCapacity( row + rowspan, col + colspan );
        for ( int r = 0; r &lt; rowspan; r++ ) {
            for ( int c = 0; c &lt; colspan; c++ ) {
                cells[row + r][col + c] = cellValue;
            }
        }
    }
    
    
    private int cellSpan( Element th, String attrName ) {
        String attrValue = th.attr( attrName );
        int result = 1;
        if ( attrValue.isEmpty() ) return result;
        try {
            result = Integer.parseInt( attrValue );
        } catch ( NumberFormatException ex ) { /*ignore*/ };
        return result;
    }
    
    // http://www.w3.org/TR/html5/tabular-data.html#algorithm-for-processing-row-groups
    private void parseTHead( Element thead ) {
        //int y_start = y_height; // #1
        int y_current = 0;
        final Elements rows = thead.children().select( "tr" );
        final int rowsNumber = rows.size();
        ensureCapacity(rowsNumber, x_width);
        for ( Element tr : rows ) { // #2
            //http://www.w3.org/TR/html5/tabular-data.html#algorithm-for-processing-rows
            if ( y_height == y_current ) {
                y_height += 1;
            }
            int x_current = 0;
            //TODO: Run the algorithm for growing 'downward-growing cells'.
            for ( Element currentCell : tr.children().select( "td, th" ) ) {
                //6. While xcurrent is less than xwidth and the slot with coordinate (xcurrent, ycurrent)
                //  already has a cell assigned to it, increase xcurrent by 1.
                while ( x_current &lt; x_width &amp;&amp; cells[y_current][x_current] != null ) x_current += 1;
                if ( x_current == x_width ) {
                    x_width += 1; //# 7
                }
                int colspan = cellSpan( currentCell, "colspan" ); //#8 
                int rowspan = cellSpan( currentCell, "rowspan" ); //#9
                if (colspan == 0) colspan = 1; 
                //TODO: 10. If rowspan is zero and the table element's Document is not set to quirks mode,
                // then let 'cell grows downward' be true, and set rowspan to 1.
                // Otherwise, let cell grows downward be false.
                //FIXME: не позволяем rowspan создавать больше строк, чем есть &lt;tr&gt;
                //  как этот вопрос решен в стандарте?
                rowspan = Math.min( rowsNumber - y_current, rowspan );
                if ( x_width &lt; x_current + colspan ) x_width = x_current + colspan;
                if ( y_height &lt; y_current + rowspan ) y_height = y_current + rowspan;
                // TODO: If any of the slots involved already had a cell covering them,
                //   then this is a table model error.
                //   Those slots now have two cells overlapping.
                fill( currentCell.text(), y_current, x_current, rowspan, colspan ); // #13
                // TODO: If 'cell grows downward' is true, then add the tuple
                //   {c, xcurrent, colspan} to the list of 'downward-growing cells'.
                x_current += colspan; //#15
            }
            y_current += 1;
        }
    }
    
    public List&lt;String&gt; lastRow() {
        return Arrays.stream( cells[y_height - 1]).limit( x_width ).collect( Collectors.toList());
    }
    

    }

    private static List<String> headers3(String html) {
    Document doc = Jsoup.parse(html);

    Element firstThead = doc.select("thead").first();
    
    TableHeader header = new TableHeader(10, 10, firstThead);
    
    return header.lastRow();
    

    }

    No case of enforcement rowspan="0"like all the manipulations with width and height can be wrapped in. filland nothing but your examples, I didn't check it. As a bonus, such an approach allows for the easy acquisition of a complete column head.

    upd: There is an obvious problem with a case where y_current + rowspan more than <tr>resulting fill creates extra rows, which is not observed in the browser. C colspan I'm sure the same situation. For now, I'm just cut. rowspan On top, but I obviously don't understand the standard.




Suggested Topics

  • 2
  • 2
  • 2
  • 2
  • 2
  • 2
  • 2
  • 2
  • 2
  • 2
  • 2
  • 2
  • 2
  • 2
  • 2