Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

typeclassed String to StringLike #82

Open
wants to merge 7 commits into
base: master
Choose a base branch
from
Open
Show file tree
Hide file tree
Changes from 3 commits
Commits
File filter

Filter by extension

Filter by extension

Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
3 changes: 3 additions & 0 deletions .gitignore
Original file line number Diff line number Diff line change
Expand Up @@ -10,3 +10,6 @@ cabal-dev
.stack-work/
cabal.sandbox.config
cabal.config
*.swp
*.swo
*.DS_Store
1 change: 1 addition & 0 deletions .travis/stack-lts-2.yaml
Original file line number Diff line number Diff line change
Expand Up @@ -8,4 +8,5 @@ extra-deps:
- http-client-0.4.30
- http-client-tls-0.2.4
- pointedlist-0.6.1
- tagsoup-0.14.8@sha256:56b2023d2e9fdbff093719ce9af5285d2436b234c6c684d6a69f14595a8348ae
resolver: lts-2.0
1 change: 1 addition & 0 deletions .travis/stack-lts-6.yaml
Original file line number Diff line number Diff line change
Expand Up @@ -6,4 +6,5 @@ packages:
extra-deps:
- fail-4.9.0.0
- http-client-0.4.30
- tagsoup-0.14.8@sha256:56b2023d2e9fdbff093719ce9af5285d2436b234c6c684d6a69f14595a8348ae
resolver: lts-6.0
1 change: 1 addition & 0 deletions .travis/stack-lts-7.yaml
Original file line number Diff line number Diff line change
Expand Up @@ -5,4 +5,5 @@ packages:
- ../examples/
extra-deps:
- fail-4.9.0.0
- tagsoup-0.14.8@sha256:56b2023d2e9fdbff093719ce9af5285d2436b234c6c684d6a69f14595a8348ae
resolver: lts-7.4
5 changes: 4 additions & 1 deletion examples/complex-predicates/Main.hs
Original file line number Diff line number Diff line change
@@ -1,10 +1,13 @@
{-# LANGUAGE OverloadedStrings #-}
{-# LANGUAGE ExtendedDefaultRules #-}

import Text.HTML.Scalpel
import Control.Applicative
import Control.Monad
import Data.List (isInfixOf)

default (String)


exampleHtml :: String
exampleHtml = "<html>\
Expand Down Expand Up @@ -37,7 +40,7 @@ catComment :: Scraper String String
catComment =
-- 1. First narrow the current context to the div containing the comment's
-- textual content.
chroot ("div" @: [hasClass "comment", hasClass "text"]) $ do
chroot (("div" :: TagName String) @: [hasClass "comment", hasClass "text"]) $ do
Copy link
Owner

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think this is why I chose to mix String and StringLike. It seems a lot less convenient to have to explicitly type each tag name as a TagName String.

Is there a specific benefit that using StringLike enables?

Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

What if we have a String or Text specific file that just declares the type

Copy link
Owner

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Hrmmm, there's already Text.HTML.Scalpel and Text.HTML.Scalpel.Core. I'd rather not create more re-exports of the API if its not necessary.

Other than API consistency is there a benefit to having StringLike, Text, and String versions of the API? I'm not super familiar with the implementation of OverloadedStrings, but I think the conversion happens at run time. If that's the case I don't think there would be a performance difference since internally, the library is already converting to Text for comparing nodes.

Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I am using libs that use Text and it's inconvenient to convert to strings and from strings.

What if we have a classed based file Text.HTML.Scalpel.StringLike and then just apply String to all the functions in Text.HTML.Scalpel?

Copy link
Owner

@fimad fimad May 15, 2019

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Gotcha, I was not really expecting people to be passing dynamic values as attributes. I think having a single Text.HTML.Scalpel.StringLike is a bit more palatable, let's go with that.

-- 2. Any can be used to access the root tag of the current context.
contents <- text anySelector
-- 3. Skip comment divs that do not contain "cat".
Expand Down
13 changes: 8 additions & 5 deletions examples/example-from-documentation/Main.hs
Original file line number Diff line number Diff line change
@@ -1,8 +1,11 @@
{-# LANGUAGE OverloadedStrings #-}
{-# LANGUAGE ExtendedDefaultRules #-}

import Text.HTML.Scalpel
import Control.Applicative

default (String)


exampleHtml :: String
exampleHtml = "<html>\
Expand Down Expand Up @@ -35,19 +38,19 @@ main :: IO ()
main = print $ scrapeStringLike exampleHtml comments
where
comments :: Scraper String [Comment]
comments = chroots ("div" @: [hasClass "container"]) comment
comments = chroots (("div" :: TagName String) @: [hasClass "container"]) comment

comment :: Scraper String Comment
comment = textComment <|> imageComment

textComment :: Scraper String Comment
textComment = do
author <- text $ "span" @: [hasClass "author"]
commentText <- text $ "div" @: [hasClass "text"]
author <- text $ ("span" :: TagName String) @: [hasClass "author"]
commentText <- text $ ("div" :: TagName String) @: [hasClass "text"]
return $ TextComment author commentText

imageComment :: Scraper String Comment
imageComment = do
author <- text $ "span" @: [hasClass "author"]
imageURL <- attr "src" $ "img" @: [hasClass "image"]
author <- text $ ("span" :: TagName String) @: [hasClass "author"]
imageURL <- attr "src" $ ("img" :: TagName String) @: [hasClass "image"]
return $ ImageComment author imageURL
1 change: 1 addition & 0 deletions examples/html-to-markdown/Main.hs
Original file line number Diff line number Diff line change
Expand Up @@ -2,6 +2,7 @@
-- the rough markdown equivalent using serial scrapers.

{-# LANGUAGE OverloadedStrings #-}
{-# LANGUAGE RankNTypes #-}
import Control.Applicative
import System.Environment
import Text.HTML.Scalpel
Expand Down
3 changes: 3 additions & 0 deletions scalpel-core/benchmarks/Main.hs
Original file line number Diff line number Diff line change
@@ -1,4 +1,5 @@
{-# LANGUAGE OverloadedStrings #-}
{-# LANGUAGE ExtendedDefaultRules #-}

import Text.HTML.Scalpel.Core

Expand All @@ -9,6 +10,8 @@ import Data.Foldable (foldr')
import qualified Data.Text as T
import qualified Text.HTML.TagSoup as TagSoup

default (String, Int)


main :: IO ()
main = do
Expand Down
2 changes: 1 addition & 1 deletion scalpel-core/scalpel-core.cabal
Original file line number Diff line number Diff line change
Expand Up @@ -50,7 +50,7 @@ library
, pointedlist
, regex-base
, regex-tdfa
, tagsoup >= 0.12.2
, tagsoup >= 0.14.8
, text
, vector
default-extensions:
Expand Down
45 changes: 23 additions & 22 deletions scalpel-core/src/Text/HTML/Scalpel/Internal/Scrape.hs
Original file line number Diff line number Diff line change
Expand Up @@ -22,6 +22,7 @@ import Text.HTML.Scalpel.Internal.Select.Types
import Control.Applicative
import Control.Monad
import Data.Maybe
import Text.StringLike (StringLike, castString)
Copy link
Owner

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Nice! I never really liked that StringLike was a part of the TagSoup package.

Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think StringLike is an abomination as well


import qualified Control.Monad.Fail as Fail
import qualified Data.Vector as Vector
Expand Down Expand Up @@ -67,7 +68,7 @@ instance Fail.MonadFail (Scraper str) where

-- | The 'scrape' function executes a 'Scraper' on a list of
-- 'TagSoup.Tag's and produces an optional value.
scrape :: (TagSoup.StringLike str)
scrape :: (StringLike str)
=> Scraper str a -> [TagSoup.Tag str] -> Maybe a
scrape s = scrapeTagSpec s . tagsToSpec . TagSoup.canonicalizeTags

Expand All @@ -77,7 +78,7 @@ scrape s = scrapeTagSpec s . tagsToSpec . TagSoup.canonicalizeTags
--
-- This function will match only the first set of tags matching the selector, to
-- match every set of tags, use 'chroots'.
chroot :: (TagSoup.StringLike str)
chroot :: (StringLike str)
=> Selector -> Scraper str a -> Scraper str a
chroot selector inner = do
maybeResult <- listToMaybe <$> chroots selector inner
Expand All @@ -91,30 +92,30 @@ chroot selector inner = do
--
-- > s = "<div><div>A</div></div>"
-- > scrapeStringLike s (chroots "div" (pure 0)) == Just [0, 0]
chroots :: (TagSoup.StringLike str)
chroots :: (StringLike str)
=> Selector -> Scraper str a -> Scraper str [a]
chroots selector (MkScraper inner) = MkScraper
$ return . mapMaybe inner . select selector

-- | The 'matches' function takes a selector and returns `()` if the selector
-- matches any node in the DOM.
matches :: (TagSoup.StringLike str) => Selector -> Scraper str ()
matches :: (StringLike str) => Selector -> Scraper str ()
matches s = MkScraper $ withHead (pure ()) . select s

-- | The 'text' function takes a selector and returns the inner text from the
-- set of tags described by the given selector.
--
-- This function will match only the first set of tags matching the selector, to
-- match every set of tags, use 'texts'.
text :: (TagSoup.StringLike str) => Selector -> Scraper str str
text :: (StringLike str) => Selector -> Scraper str str
text s = MkScraper $ withHead tagsToText . select s

-- | The 'texts' function takes a selector and returns the inner text from every
-- set of tags (possibly nested) matching the given selector.
--
-- > s = "<div>Hello <div>World</div></div>"
-- > scrapeStringLike s (texts "div") == Just ["Hello World", "World"]
texts :: (TagSoup.StringLike str)
texts :: (StringLike str)
=> Selector -> Scraper str [str]
texts s = MkScraper $ withAll tagsToText . select s

Expand All @@ -123,15 +124,15 @@ texts s = MkScraper $ withAll tagsToText . select s
--
-- This function will match only the first set of tags matching the selector, to
-- match every set of tags, use 'htmls'.
html :: (TagSoup.StringLike str) => Selector -> Scraper str str
html :: (StringLike str) => Selector -> Scraper str str
html s = MkScraper $ withHead tagsToHTML . select s

-- | The 'htmls' function takes a selector and returns the html string from
-- every set of tags (possibly nested) matching the given selector.
--
-- > s = "<div><div>A</div></div>"
-- > scrapeStringLike s (htmls "div") == Just ["<div><div>A</div></div>", "<div>A</div>"]
htmls :: (TagSoup.StringLike str)
htmls :: (StringLike str)
=> Selector -> Scraper str [str]
htmls s = MkScraper $ withAll tagsToHTML . select s

Expand All @@ -141,7 +142,7 @@ htmls s = MkScraper $ withAll tagsToHTML . select s
--
-- This function will match only the first set of tags matching the selector, to
-- match every set of tags, use 'innerHTMLs'.
innerHTML :: (TagSoup.StringLike str)
innerHTML :: (StringLike str)
=> Selector -> Scraper str str
innerHTML s = MkScraper $ withHead tagsToInnerHTML . select s

Expand All @@ -150,7 +151,7 @@ innerHTML s = MkScraper $ withHead tagsToInnerHTML . select s
--
-- > s = "<div><div>A</div></div>"
-- > scrapeStringLike s (innerHTMLs "div") == Just ["<div>A</div>", "A"]
innerHTMLs :: (TagSoup.StringLike str)
innerHTMLs :: (StringLike str)
=> Selector -> Scraper str [str]
innerHTMLs s = MkScraper $ withAll tagsToInnerHTML . select s

Expand All @@ -160,22 +161,22 @@ innerHTMLs s = MkScraper $ withAll tagsToInnerHTML . select s
--
-- This function will match only the opening tag matching the selector, to match
-- every tag, use 'attrs'.
attr :: (Show str, TagSoup.StringLike str)
=> String -> Selector -> Scraper str str
attr :: (Show str, StringLike str)
=> str -> Selector -> Scraper str str
attr name s = MkScraper
$ join . withHead (tagsToAttr $ TagSoup.castString name) . select s
$ join . withHead (tagsToAttr $ castString name) . select s

-- | The 'attrs' function takes an attribute name and a selector and returns the
-- value of the attribute of the given name for every opening tag
-- (possibly nested) that matches the given selector.
--
-- > s = "<div id=\"out\"><div id=\"in\"></div></div>"
-- > scrapeStringLike s (attrs "id" "div") == Just ["out", "in"]
attrs :: (Show str, TagSoup.StringLike str)
=> String -> Selector -> Scraper str [str]
attrs :: (Show str, StringLike str)
=> str -> Selector -> Scraper str [str]
attrs name s = MkScraper
$ fmap catMaybes . withAll (tagsToAttr nameStr) . select s
where nameStr = TagSoup.castString name
where nameStr = castString name

-- | The 'position' function is intended to be used within the do-block of a
-- `chroots` call. Within the do-block position will return the index of the
Expand Down Expand Up @@ -211,7 +212,7 @@ attrs name s = MkScraper
-- , (2, "Third paragraph.")
-- ]
-- @
position :: (TagSoup.StringLike str) => Scraper str Int
position :: (StringLike str) => Scraper str Int
position = MkScraper $ Just . tagsToPosition

withHead :: (a -> b) -> [a] -> Maybe b
Expand All @@ -221,27 +222,27 @@ withHead f (x:_) = Just $ f x
withAll :: (a -> b) -> [a] -> Maybe [b]
withAll f xs = Just $ map f xs

foldSpec :: TagSoup.StringLike str
foldSpec :: StringLike str
=> (TagSoup.Tag str -> str -> str) -> TagSpec str -> str
foldSpec f = Vector.foldr' (f . infoTag) TagSoup.empty . (\(a, _, _) -> a)


tagsToText :: TagSoup.StringLike str => TagSpec str -> str
tagsToText :: StringLike str => TagSpec str -> str
tagsToText = foldSpec f
where
f (TagSoup.TagText str) s = str `TagSoup.append` s
f _ s = s

tagsToHTML :: TagSoup.StringLike str => TagSpec str -> str
tagsToHTML :: StringLike str => TagSpec str -> str
tagsToHTML = foldSpec (\tag s -> TagSoup.renderTags [tag] `TagSoup.append` s)

tagsToInnerHTML :: TagSoup.StringLike str => TagSpec str -> str
tagsToInnerHTML :: StringLike str => TagSpec str -> str
tagsToInnerHTML (tags, tree, ctx)
| len < 2 = TagSoup.empty
| otherwise = tagsToHTML (Vector.slice 1 (len - 2) tags, tree, ctx)
where len = Vector.length tags

tagsToAttr :: (Show str, TagSoup.StringLike str)
tagsToAttr :: (Show str, StringLike str)
=> str -> TagSpec str -> Maybe str
tagsToAttr attr (tags, _, _) = do
guard $ 0 < Vector.length tags
Expand Down
26 changes: 13 additions & 13 deletions scalpel-core/src/Text/HTML/Scalpel/Internal/Select.hs
Original file line number Diff line number Diff line change
Expand Up @@ -15,13 +15,13 @@ import Text.HTML.Scalpel.Internal.Select.Types

import Control.Applicative ((<$>), (<|>))
import Data.Maybe (catMaybes, isJust, fromJust, fromMaybe)
import Text.StringLike (StringLike, castString)

import qualified Data.Map.Strict as Map
import qualified Data.Text as T
import qualified Data.Tree as Tree
import qualified Data.Vector as Vector
import qualified Text.HTML.TagSoup as TagSoup
import qualified Text.StringLike as TagSoup


type Index = Int
Expand Down Expand Up @@ -73,7 +73,7 @@ type TagSpec str = (TagVector str, TagForest, SelectContext)
-- | The 'select' function takes a 'Selectable' value and a list of
-- 'TagSoup.Tag's and returns a list of every subsequence of the given list of
-- Tags that matches the given selector.
select :: (TagSoup.StringLike str)
select :: (StringLike str)
=> Selector -> TagSpec str -> [TagSpec str]
select s tagSpec = newSpecs
where
Expand All @@ -83,7 +83,7 @@ select s tagSpec = newSpecs
applyPosition p (tags, f, _) = (tags, f, SelectContext p True)

-- | Creates a TagSpec from a list of tags parsed by TagSoup.
tagsToSpec :: forall str. (TagSoup.StringLike str)
tagsToSpec :: forall str. (StringLike str)
=> [TagSoup.Tag str] -> TagSpec str
tagsToSpec tags = (vector, tree, ctx)
where
Expand Down Expand Up @@ -113,7 +113,7 @@ tagsToSpec tags = (vector, tree, ctx)
-- closing offset.
--
-- (5) The result set is then sorted by their indices.
tagsToVector :: forall str. (TagSoup.StringLike str)
tagsToVector :: forall str. (StringLike str)
=> [TagSoup.Tag str] -> TagVector str
tagsToVector tags = let indexed = zip tags [0..]
total = length indexed
Expand Down Expand Up @@ -164,17 +164,17 @@ tagsToVector tags = let indexed = zip tags [0..]
popTag (Just (_ : y : xs)) = let s = y : xs in s `seq` Just s
popTag _ = Nothing

getTagName :: TagSoup.StringLike str => TagSoup.Tag str -> T.Text
getTagName (TagSoup.TagOpen name _) = TagSoup.castString name
getTagName (TagSoup.TagClose name) = TagSoup.castString name
getTagName :: StringLike str => TagSoup.Tag str -> T.Text
getTagName (TagSoup.TagOpen name _) = castString name
getTagName (TagSoup.TagClose name) = castString name
getTagName _ = undefined

-- | Builds a forest describing the structure of the tags within a given vector.
-- The nodes of the forest are tag spans which mark the indices within the
-- vector of an open and close pair. The tree is organized such for any node n
-- the parent of node n is the smallest span that completely encapsulates the
-- span of node n.
vectorToTree :: TagSoup.StringLike str => TagVector str -> TagForest
vectorToTree :: StringLike str => TagVector str -> TagForest
vectorToTree tags = fixup $ forestWithin 0 (Vector.length tags)
where
forestWithin :: Int -> Int -> TagForest
Expand Down Expand Up @@ -217,7 +217,7 @@ vectorToTree tags = fixup $ forestWithin 0 (Vector.length tags)
-- tree the SelectNode is popped and the current node's sub-forest is traversed
-- with the remaining SelectNodes. If there is only a single SelectNode then any
-- node encountered that satisfies the SelectNode is returned as an answer.
selectNodes :: TagSoup.StringLike str
selectNodes :: StringLike str
=> [(SelectNode, SelectSettings)]
-> TagSpec str
-> TagSpec str
Expand Down Expand Up @@ -305,7 +305,7 @@ boolMatch True = MatchOk
boolMatch False = MatchFail

-- | Returns True if a tag satisfies a given SelectNode's condition.
nodeMatches :: TagSoup.StringLike str
nodeMatches :: StringLike str
=> (SelectNode, SelectSettings)
-> TagInfo str
-> TagSpec str
Expand All @@ -322,7 +322,7 @@ nodeMatches (SelectText , settings) info cur root =
-- | Given a SelectSettings, the current node under consideration, and the last
-- matched node, returns true IFF the current node satisfies all of the
-- selection settings.
checkSettings :: TagSoup.StringLike str
checkSettings :: StringLike str
=> SelectSettings -> TagSpec str -> TagSpec str -> MatchResult
checkSettings (SelectSettings (Just depth))
(_, curRoot : _, _)
Expand All @@ -341,7 +341,7 @@ checkSettings (SelectSettings _) _ _ = MatchOk

-- | Given a tag name and a list of attribute predicates return a function that
-- returns true if a given tag matches the supplied name and predicates.
checkTag :: TagSoup.StringLike str
checkTag :: StringLike str
=> T.Text -> [AttributePredicate] -> TagInfo str -> MatchResult
checkTag name preds (TagInfo tag tagName _)
= boolMatch (
Expand All @@ -351,7 +351,7 @@ checkTag name preds (TagInfo tag tagName _)
) `andMatch` checkPreds preds tag

-- | Returns True if a tag satisfies a list of attribute predicates.
checkPreds :: TagSoup.StringLike str
checkPreds :: StringLike str
=> [AttributePredicate] -> TagSoup.Tag str -> MatchResult
checkPreds [] tag = boolMatch
$ TagSoup.isTagOpen tag || TagSoup.isTagText tag
Expand Down
Loading