Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Consider Get-PSHTMLDocument #250

Open
Stephanevg opened this issue Jul 12, 2019 · 4 comments
Open

Consider Get-PSHTMLDocument #250

Stephanevg opened this issue Jul 12, 2019 · 4 comments
Labels
design discussion enhancement New feature or request help wanted Extra attention is needed
Milestone

Comments

@Stephanevg
Copy link
Owner

It would be nice to have a function which could read a HTML page out, and send an object back, which could be developed further, or even converted to an PSHTML Powershell file (is that utopic?)

  1. The parsing

For that, we will need the ability to parse a HTML document.

This snippet might be an option to do so:

Add-Type -AssemblyName System.Xml.Linq
$txt=[IO.File]::ReadAllText("c:\myhtml.html")
$xml = [System.Xml.Linq.XDocument]::Parse($txt)
$ns='http://www.w3.org/1999/xhtml'
$divs=$cells = $xml.Descendants("{$ns}td")
  1. Create a PSHTML.Document object
    Once it is parsed (or while parsing) we could create for each html element the corrsponding PSHTML Object.
    This would assume that this issue is closed and implemented first -> Create core PSHTML object (PSHTML.Document) #218
@Stephanevg Stephanevg added design discussion enhancement New feature or request help wanted Extra attention is needed labels Jul 12, 2019
@Stephanevg Stephanevg added this to the 0.9.0 milestone Jul 20, 2019
@LxLeChat
Copy link
Contributor

Hi,
I think it's better to use something really dedicated to html. System.xml is xml focused. I tried with the small function i created in #218 and it fails when some "special" html syntaxes are used ( atom stuff .. ).

I tried with the htmlagilitypack and ... well it's html oriented html, and it's almost the same. it also works on pscore (6.2)

it's available here: https://html-agility-pack.net
(download the nuget package, and unzip it somewhere )

[Reflection.Assembly]::LoadFrom("C:\Users\Lx\Downloads\htmlagilitypack.1.11.12\lib\Net45\HtmlAgilityPack.dll")
$html = New-Object -TypeName HtmlAgilityPack.HtmlDocument
$html.LoadHtml($a)
$html.DocumentNode

@LxLeChat
Copy link
Contributor

here is a working example with htmlagilitypack, and core pshtml with classes like in #218
first; loading htmlagilitypack [Reflection.Assembly]::LoadFrom("C:\Users\Lx\Downloads\htmlagilitypack.1.11.12\lib\Net45\HtmlAgilityPack.dll")

then, get html code from your favorite page, copy/paste it inside an html file
fetch the content $a = get-content .\yourhtmlpage.html

and voila:

PS C:\Users\Lx> $x = get-pshtmldocument -html $a
PS C:\Users\Lx> $x

TagName  id Class Children
-------  -- ----- --------
                  {$null}
#comment          {}
html              {, }    


PS C:\Users\Lx> $x[2]

TagName id Class Children

PS C:\Users\Lx> $x[2].children[1].children

TagName  id         Class                                                        Children
-------  --         -----                                                        --------
script                                                                           {}
script                                                                           {var config = {     autoCapture: {             lineage: true     }... 
noscript                                                                         {}
div      headerArea uhf                                                          {headerRegion}
link                                                                             {}
link                                                                             {}
script                                                                           {}
div      page       hfeed site                                                   {single-wrapper, wrapper-footer}
div                 a2a_kit a2a_kit_size_32 a2a_floating_style a2a_default_style {, , }
script                                                                           {var CrayonSyntaxSettings = {"version":"_2.7.2_beta","is_admin":"0... 
script                                                                           {(function (undefined) {var _targetWindow ="prefer-popup"; window.... 
script                                                                           {/*{literal}*/window.lightningjs||function(c){function g(b,d){d&&(... 
div      footerArea uhf                                                          {footerRegion}
link                                                                             {}
link                                                                             {}
script                                                                           {}
script                                                                           {//fix calendar hide when change month        var string = window.... 
script                                                                           {}
script                                                                           {window.NREUM||(NREUM={});NREUM.info={"beacon":"bam.nr-data.net","... 


PS C:\Users\Lx>

the function itself:

function get-pshtmldocument {
    param (
        $html
    )

    begin {

       function HtmlToPSHTMLClass {
            param(
                $node
            )

            If ( $node.nodetype -ne 'Text' ) {

                $plop = [htmlParentElement]::New()
                $plop.SetTagName($node.Name)
                $plop.Id = $node.Attributes.where({$_.name -eq 'id'}).Value
                $plop.Class = $node.Attributes.where({$_.name -eq 'class'}).Value

                If ( $node.hasChildNodes ) { 
                    foreach ( $n in $node.childnodes ) {
##some nodes are 'empty' so i did this ... maybe a bug ???
                        If ( $n.nodetype -eq 'Text' -and $n.InnerText.trim() -ne '' ) {
                            $child = $n.InnerText
                            $plop.AddChild( $child )
                        } elseif ( $n.nodetype -ne 'Text') {
                            $child = HtmlToPSHTMLClass -node $n
                            $plop.AddChild( $child )
                        }
                    }
                }
            }

            $plop
        } 

    }

    process {

        $document = New-Object -TypeName HtmlAgilityPack.HtmlDocument
        $document.LoadHtml($html)

        Foreach( $node in $document.DocumentNode.ChildNodes ) {
            HtmlToPSHTMLClass -node $node
        }

    }

    end {

    }

}

@Stephanevg
Copy link
Owner Author

A side note: The HTML Agility Pack (HAP) is MIT licenced. So we could strongly consider it...

@Stephanevg
Copy link
Owner Author

Another side note: It looks like Justin Grote already wrote a powershell implementation of the Agility Pack.
PowerHTML (Under MIT as well)

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
design discussion enhancement New feature or request help wanted Extra attention is needed
Projects
None yet
Development

No branches or pull requests

2 participants