Reference
DocsScraper.base_url_segment
DocsScraper.check_robots_txt
DocsScraper.clean_url
DocsScraper.crawl
DocsScraper.create_output_folders
DocsScraper.docs_in_url
DocsScraper.find_duplicates
DocsScraper.find_urls_html!
DocsScraper.find_urls_xml!
DocsScraper.generate_embeddings
DocsScraper.get_base_url
DocsScraper.get_header_path
DocsScraper.get_html_content
DocsScraper.get_package_name
DocsScraper.get_urls!
DocsScraper.insert_parsed_data!
DocsScraper.l2_norm_columns
DocsScraper.l2_norm_columns
DocsScraper.make_chunks
DocsScraper.make_knowledge_packs
DocsScraper.nav_bar
DocsScraper.parse_robots_txt!
DocsScraper.parse_url_to_blocks
DocsScraper.postprocess_chunks
DocsScraper.process_code
DocsScraper.process_docstring!
DocsScraper.process_generic_node!
DocsScraper.process_headings!
DocsScraper.process_hostname
DocsScraper.process_hostname!
DocsScraper.process_node!
DocsScraper.process_node!
DocsScraper.process_paths
DocsScraper.remove_duplicates
DocsScraper.remove_short_chunks
DocsScraper.remove_urls_from_index
DocsScraper.report_artifact
DocsScraper.resolve_url
DocsScraper.roll_up_chunks
DocsScraper.text_before_version
DocsScraper.url_package_name
DocsScraper.urls_for_metadata
PromptingTools.Experimental.RAGTools.get_chunks
DocsScraper.base_url_segment
— Methodbase_url_segment(url::String)
Return the base url and first path segment if all the other checks fail
DocsScraper.check_robots_txt
— Methodcheck_robots_txt(user_agent::AbstractString, url::AbstractString)
Check robots.txt of a URL and return a boolean representing if user_agent
is allowed to crawl the input url, along with sitemap urls
Arguments
user_agent
: user agent attempting to crawl the webpageurl
: input URL string
DocsScraper.clean_url
— Methodclean_url(url::String)
Strip URL of any http:// ot https:// or www. prefixes
DocsScraper.crawl
— Methodcrawl(input_urls::Vector{<:AbstractString})
Crawl on the input URLs and return a hostname_url_dict
which is a dictionary with key being hostnames and the values being the URLs
DocsScraper.create_output_folders
— Methodcreate_output_folders(knowledge_pack_path::String)
Create output folders on the knowledgepackpath
DocsScraper.docs_in_url
— Methoddocs_in_url(url::AbstractString)
If the base url is in the form docs.packagename.domainextension, then return the middle word i.e., package_name
DocsScraper.find_duplicates
— Methodfind_duplicates(chunks::AbstractVector{<:AbstractString})
Find duplicates in a list of chunks using SHA-256 hash. Returns a bit vector of the same length as the input list, where true
indicates a duplicate (second instance of the same text).
DocsScraper.find_urls_html!
— Methodfind_urls_html!(url::AbstractString, node::Gumbo.HTMLElement, url_queue::Vector{<:AbstractString}
Function to recursively find <a> tags and extract the urls
Arguments
- url: The initial input URL
- node: The HTML node of type Gumbo.HTMLElement
- url_queue: Vector in which extracted URLs will be appended
DocsScraper.find_urls_xml!
— Methodfind_urls_xml!(url::AbstractString, url_queue::Vector{<:AbstractString})
Identify URL through regex pattern in xml files and push in url_queue
Arguments
- url: url from which all other URLs will be extracted
- url_queue: Vector in which extracted URLs will be appended
DocsScraper.generate_embeddings
— Methodgenerate_embeddings(knowledge_pack_path::String; model::AbstractString=MODEL,
- embedding_size::Int=EMBEDDING_SIZE, custom_metadata::AbstractString)
Deserialize chunks and sources to generate embeddings
Arguments
- model: Embedding model
- embedding_size: Embedding dimensions
- custom_metadata: Custom metadata like ecosystem name if required
DocsScraper.get_base_url
— Methodget_base_url(url::AbstractString)
Extract the base url
DocsScraper.get_header_path
— Methodget_header_path(d::Dict)
Concatenate the h1, h2, h3 keys from the metadata of a Dict
Examples
d = Dict("metadata" => Dict{Symbol,Any}(:h1 => "Axis", :h2 => "Attributes", :h3 => "yzoomkey"), "heading" => "yzoomkey")
+API Index · DocsScraper.jl Reference
DocsScraper.base_url_segment
DocsScraper.check_robots_txt
DocsScraper.clean_url
DocsScraper.crawl
DocsScraper.create_output_folders
DocsScraper.docs_in_url
DocsScraper.find_duplicates
DocsScraper.find_urls_html!
DocsScraper.find_urls_xml!
DocsScraper.generate_embeddings
DocsScraper.get_base_url
DocsScraper.get_header_path
DocsScraper.get_html_content
DocsScraper.get_package_name
DocsScraper.get_urls!
DocsScraper.insert_parsed_data!
DocsScraper.l2_norm_columns
DocsScraper.l2_norm_columns
DocsScraper.make_chunks
DocsScraper.make_knowledge_packs
DocsScraper.nav_bar
DocsScraper.parse_robots_txt!
DocsScraper.parse_url_to_blocks
DocsScraper.postprocess_chunks
DocsScraper.process_code
DocsScraper.process_docstring!
DocsScraper.process_generic_node!
DocsScraper.process_headings!
DocsScraper.process_hostname
DocsScraper.process_hostname!
DocsScraper.process_node!
DocsScraper.process_node!
DocsScraper.process_paths
DocsScraper.remove_duplicates
DocsScraper.remove_short_chunks
DocsScraper.remove_urls_from_index
DocsScraper.report_artifact
DocsScraper.resolve_url
DocsScraper.roll_up_chunks
DocsScraper.text_before_version
DocsScraper.url_package_name
DocsScraper.urls_for_metadata
PromptingTools.Experimental.RAGTools.get_chunks
DocsScraper.base_url_segment
— Methodbase_url_segment(url::String)
Return the base url and first path segment if all the other checks fail
sourceDocsScraper.check_robots_txt
— Methodcheck_robots_txt(user_agent::AbstractString, url::AbstractString)
Check robots.txt of a URL and return a boolean representing if user_agent
is allowed to crawl the input url, along with sitemap urls
Arguments
user_agent
: user agent attempting to crawl the webpageurl
: input URL string
sourceDocsScraper.clean_url
— Methodclean_url(url::String)
Strip URL of any http:// ot https:// or www. prefixes
sourceDocsScraper.crawl
— Methodcrawl(input_urls::Vector{<:AbstractString})
Crawl on the input URLs and return a hostname_url_dict
which is a dictionary with key being hostnames and the values being the URLs
sourceDocsScraper.create_output_folders
— Methodcreate_output_folders(knowledge_pack_path::String)
Create output folders on the knowledgepackpath
sourceDocsScraper.docs_in_url
— Methoddocs_in_url(url::AbstractString)
If the base url is in the form docs.packagename.domainextension, then return the middle word i.e., package_name
sourceDocsScraper.find_duplicates
— Methodfind_duplicates(chunks::AbstractVector{<:AbstractString})
Find duplicates in a list of chunks using SHA-256 hash. Returns a bit vector of the same length as the input list, where true
indicates a duplicate (second instance of the same text).
sourceDocsScraper.find_urls_html!
— Methodfind_urls_html!(url::AbstractString, node::Gumbo.HTMLElement, url_queue::Vector{<:AbstractString}
Function to recursively find <a> tags and extract the urls
Arguments
- url: The initial input URL
- node: The HTML node of type Gumbo.HTMLElement
- url_queue: Vector in which extracted URLs will be appended
sourceDocsScraper.find_urls_xml!
— Methodfind_urls_xml!(url::AbstractString, url_queue::Vector{<:AbstractString})
Identify URL through regex pattern in xml files and push in url_queue
Arguments
- url: url from which all other URLs will be extracted
- url_queue: Vector in which extracted URLs will be appended
sourceDocsScraper.generate_embeddings
— Methodgenerate_embeddings(knowledge_pack_path::String; model::AbstractString=MODEL,
+ embedding_size::Int=EMBEDDING_SIZE, custom_metadata::AbstractString,
+ bool_embeddings::Bool = true, index_name::AbstractString = "")
Deserialize chunks and sources to generate embeddings Note: We highly recommend to pass index_name
. This will be the name of the generated index. Default: date-randomInt
Arguments
- model: Embedding model
- embedding_size: Embedding dimensions
- custom_metadata: Custom metadata like ecosystem name if required
- bool_embeddings: If true, embeddings generated will be boolean, Float32 otherwise
- index_name: Name if the index. Default: date-randomInt
sourceDocsScraper.get_base_url
— Methodget_base_url(url::AbstractString)
Extract the base url
sourceDocsScraper.get_header_path
— Methodget_header_path(d::Dict)
Concatenate the h1, h2, h3 keys from the metadata of a Dict
Examples
d = Dict("metadata" => Dict{Symbol,Any}(:h1 => "Axis", :h2 => "Attributes", :h3 => "yzoomkey"), "heading" => "yzoomkey")
get_header_path(d)
-# Output: "Axis/Attributes/yzoomkey"
sourceDocsScraper.get_html_content
— Methodget_html_content(root::Gumbo.HTMLElement)
Return the main content of the HTML. If not found, return the whole HTML to parse
Arguments
root
: The HTML root from which content is extracted
sourceDocsScraper.get_package_name
— Methodget_package_name(url::AbstractString)
Return name of the package through the package URL
sourceDocsScraper.get_urls!
— Methodget_links!(url::AbstractString,
- url_queue::Vector{<:AbstractString})
Extract urls inside html or xml files
Arguments
- url: url from which all other URLs will be extracted
- url_queue: Vector in which extracted URLs will be appended
sourceDocsScraper.insert_parsed_data!
— Methodinsert_parsed_data!(heading_hierarchy::Dict{Symbol,Any},
+# Output: "Axis/Attributes/yzoomkey"
sourceDocsScraper.get_html_content
— Methodget_html_content(root::Gumbo.HTMLElement)
Return the main content of the HTML. If not found, return the whole HTML to parse
Arguments
root
: The HTML root from which content is extracted
sourceDocsScraper.get_package_name
— Methodget_package_name(url::AbstractString)
Return name of the package through the package URL
sourceDocsScraper.get_urls!
— Methodget_links!(url::AbstractString,
+ url_queue::Vector{<:AbstractString})
Extract urls inside html or xml files
Arguments
- url: url from which all other URLs will be extracted
- url_queue: Vector in which extracted URLs will be appended
sourceDocsScraper.insert_parsed_data!
— Methodinsert_parsed_data!(heading_hierarchy::Dict{Symbol,Any},
parsed_blocks::Vector{Dict{String,Any}},
text_to_insert::AbstractString,
- text_type::AbstractString)
Insert the text into parsed_blocks Vector
Arguments
- heading_hierarchy: Dict used to store metadata
- parsed_blocks: Vector of Dicts to store parsed text and metadata
- texttoinsert: Text to be inserted
- text_type: The text to be inserted could be heading or a code block or just text
sourceDocsScraper.l2_norm_columns
— Methodl2_norm_columns(mat::AbstractMatrix)
Normalize the columns of the input embeddings
sourceDocsScraper.l2_norm_columns
— Methodl2_norm_columns(vect::AbstractVector)
Normalize the columns of the input embeddings
sourceDocsScraper.make_chunks
— Methodmake_chunks(hostname_url_dict::Dict{AbstractString,Vector{AbstractString}}, knowledge_pack_path::String;
- max_chunk_size::Int=MAX_CHUNK_SIZE, min_chunk_size::Int=MIN_CHUNK_SIZE)
Parse URLs from hostnameurldict and save the chunks
Arguments
- hostnameurldict: Dict with key being hostname and value being a vector of URLs
- knowledgepackpath: Knowledge pack path
- maxchunksize: Maximum chunk size
- minchunksize: Minimum chunk size
sourceDocsScraper.make_knowledge_packs
— Functionmake_knowledge_packs(crawlable_urls::Vector{<:AbstractString}=String[]; single_urls::Vector{<:AbstractString}=String[],
+ text_type::AbstractString)
Insert the text into parsed_blocks Vector
Arguments
- heading_hierarchy: Dict used to store metadata
- parsed_blocks: Vector of Dicts to store parsed text and metadata
- texttoinsert: Text to be inserted
- text_type: The text to be inserted could be heading or a code block or just text
sourceDocsScraper.l2_norm_columns
— Methodl2_norm_columns(mat::AbstractMatrix)
Normalize the columns of the input embeddings
sourceDocsScraper.l2_norm_columns
— Methodl2_norm_columns(vect::AbstractVector)
Normalize the columns of the input embeddings
sourceDocsScraper.make_chunks
— Methodmake_chunks(hostname_url_dict::Dict{AbstractString,Vector{AbstractString}}, knowledge_pack_path::String;
+ max_chunk_size::Int=MAX_CHUNK_SIZE, min_chunk_size::Int=MIN_CHUNK_SIZE)
Parse URLs from hostnameurldict and save the chunks
Arguments
- hostnameurldict: Dict with key being hostname and value being a vector of URLs
- knowledgepackpath: Knowledge pack path
- maxchunksize: Maximum chunk size
- minchunksize: Minimum chunk size
sourceDocsScraper.make_knowledge_packs
— Functionmake_knowledge_packs(crawlable_urls::Vector{<:AbstractString}=String[]; single_urls::Vector{<:AbstractString}=String[],
max_chunk_size::Int=MAX_CHUNK_SIZE, min_chunk_size::Int=MIN_CHUNK_SIZE, model::AbstractString=MODEL, embedding_size::Int=EMBEDDING_SIZE,
- custom_metadata::AbstractString)
Entry point to crawl, parse and generate embeddings
Arguments
- crawlable_urls: URLs that should be crawled to find more links
- single_urls: Single page URLs that should just be scraped and parsed. The crawler won't look for more URLs
- maxchunksize: Maximum chunk size
- minchunksize: Minimum chunk size
- model: Embedding model
- embedding_size: Embedding dimensions
- custom_metadata: Custom metadata like ecosystem name if required
sourceDocsScraper.nav_bar
— Methodnav_bar(url::AbstractString)
Julia doc websites tend to have the package name under ".docs-package-name" class in the HTML tree
sourceDocsScraper.parse_robots_txt!
— Methodparse_robots_txt!(robots_txt::String)
Parse the robots.txt string and return rules and the URLs on Sitemap
sourceDocsScraper.parse_url_to_blocks
— Methodparse_url(url::AbstractString)
Initiator and main function to parse HTML from url. Return a Vector of Dict containing Heading/Text/Code along with a Dict of respective metadata
sourceDocsScraper.postprocess_chunks
— Methodfunction postprocess_chunks(chunks::AbstractVector{<:AbstractString}, sources::AbstractVector{<:AbstractString};
+ custom_metadata::AbstractString, bool_embeddings::Bool = true, index_name::AbstractString = "")
Entry point to crawl, parse and generate embeddings. Note: We highly recommend to pass index_name
. This will be the name of the generated index. Default: date-randomInt
Arguments
- crawlable_urls: URLs that should be crawled to find more links
- single_urls: Single page URLs that should just be scraped and parsed. The crawler won't look for more URLs
- maxchunksize: Maximum chunk size
- minchunksize: Minimum chunk size
- model: Embedding model
- embedding_size: Embedding dimensions
- custom_metadata: Custom metadata like ecosystem name if required
- bool_embeddings: If true, embeddings generated will be boolean, Float32 otherwise
- index_name: Name if the index. Default: date-randomInt
sourceDocsScraper.nav_bar
— Methodnav_bar(url::AbstractString)
Julia doc websites tend to have the package name under ".docs-package-name" class in the HTML tree
sourceDocsScraper.parse_robots_txt!
— Methodparse_robots_txt!(robots_txt::String)
Parse the robots.txt string and return rules and the URLs on Sitemap
sourceDocsScraper.parse_url_to_blocks
— Methodparse_url(url::AbstractString)
Initiator and main function to parse HTML from url. Return a Vector of Dict containing Heading/Text/Code along with a Dict of respective metadata
sourceDocsScraper.postprocess_chunks
— Methodfunction postprocess_chunks(chunks::AbstractVector{<:AbstractString}, sources::AbstractVector{<:AbstractString};
min_chunk_size::Int=MIN_CHUNK_SIZE, skip_code::Bool=true, paths::Union{Nothing,AbstractVector{<:AbstractString}}=nothing,
- websites::Union{Nothing,AbstractVector{<:AbstractString}}=nothing)
Post-process the input list of chunks and their corresponding sources by removing short chunks and duplicates.
sourceDocsScraper.process_code
— Methodprocess_code(node::Gumbo.HTMLElement)
Process code snippets. If the current node is a code block, return the text inside code block with backticks.
Arguments
- node: The root HTML node
sourceDocsScraper.process_docstring!
— Functionprocess_docstring!(node::Gumbo.HTMLElement,
+ websites::Union{Nothing,AbstractVector{<:AbstractString}}=nothing)
Post-process the input list of chunks and their corresponding sources by removing short chunks and duplicates.
sourceDocsScraper.process_code
— Methodprocess_code(node::Gumbo.HTMLElement)
Process code snippets. If the current node is a code block, return the text inside code block with backticks.
Arguments
- node: The root HTML node
sourceDocsScraper.process_docstring!
— Functionprocess_docstring!(node::Gumbo.HTMLElement,
heading_hierarchy::Dict{Symbol,Any},
parsed_blocks::Vector{Dict{String,Any}},
child_new::Bool=true,
- prev_text_buffer::IO=IOBuffer(write=true))
Function to process node of class docstring
Arguments
- node: The root HTML node
- heading_hierarchy: Dict used to store metadata
- parsed_blocks: Vector of Dicts to store parsed text and metadata
- childnew: Bool to specify if the current block (child) is part of previous block or not. If it's not, then a new insertion needs to be created in parsedblocks
- prevtextbuffer: IO Buffer which contains previous text
sourceDocsScraper.process_generic_node!
— Functionprocess_generic_node!(node::Gumbo.HTMLElement,
+ prev_text_buffer::IO=IOBuffer(write=true))
Function to process node of class docstring
Arguments
- node: The root HTML node
- heading_hierarchy: Dict used to store metadata
- parsed_blocks: Vector of Dicts to store parsed text and metadata
- childnew: Bool to specify if the current block (child) is part of previous block or not. If it's not, then a new insertion needs to be created in parsedblocks
- prevtextbuffer: IO Buffer which contains previous text
sourceDocsScraper.process_generic_node!
— Functionprocess_generic_node!(node::Gumbo.HTMLElement,
heading_hierarchy::Dict{Symbol,Any},
parsed_blocks::Vector{Dict{String,Any}},
child_new::Bool=true,
- prev_text_buffer::IO=IOBuffer(write=true))
If the node is neither heading nor code
Arguments
- node: The root HTML node
- heading_hierarchy: Dict used to store metadata
- parsed_blocks: Vector of Dicts to store parsed text and metadata
- childnew: Bool to specify if the current block (child) is part of previous block or not. If it's not, then a new insertion needs to be created in parsedblocks
- prevtextbuffer: IO Buffer which contains previous text
sourceDocsScraper.process_headings!
— Methodprocess_headings!(node::Gumbo.HTMLElement,
+ prev_text_buffer::IO=IOBuffer(write=true))
If the node is neither heading nor code
Arguments
- node: The root HTML node
- heading_hierarchy: Dict used to store metadata
- parsed_blocks: Vector of Dicts to store parsed text and metadata
- childnew: Bool to specify if the current block (child) is part of previous block or not. If it's not, then a new insertion needs to be created in parsedblocks
- prevtextbuffer: IO Buffer which contains previous text
sourceDocsScraper.process_headings!
— Methodprocess_headings!(node::Gumbo.HTMLElement,
heading_hierarchy::Dict{Symbol,Any},
- parsed_blocks::Vector{Dict{String,Any}})
Process headings. If the current node is heading, directly insert into parsed_blocks.
Arguments
- node: The root HTML node
- heading_hierarchy: Dict used to store metadata
- parsed_blocks: Vector of Dicts to store parsed text and metadata
sourceDocsScraper.process_hostname!
— Methodprocess_hostname(url::AbstractString, hostname_dict::Dict{AbstractString,Vector{AbstractString}})
Add url
to its hostname in hostname_dict
Arguments
url
: URL stringhostname_dict
: Dict with key being hostname and value being a vector of URLs
sourceDocsScraper.process_hostname
— Methodprocess_hostname(url::AbstractString)
Return the hostname of an input URL
sourceDocsScraper.process_node!
— Functionprocess_node!(node::Gumbo.HTMLElement,
+ parsed_blocks::Vector{Dict{String,Any}})
Process headings. If the current node is heading, directly insert into parsed_blocks.
Arguments
- node: The root HTML node
- heading_hierarchy: Dict used to store metadata
- parsed_blocks: Vector of Dicts to store parsed text and metadata
sourceDocsScraper.process_hostname!
— Methodprocess_hostname(url::AbstractString, hostname_dict::Dict{AbstractString,Vector{AbstractString}})
Add url
to its hostname in hostname_dict
Arguments
url
: URL stringhostname_dict
: Dict with key being hostname and value being a vector of URLs
sourceDocsScraper.process_hostname
— Methodprocess_hostname(url::AbstractString)
Return the hostname of an input URL
sourceDocsScraper.process_node!
— Functionprocess_node!(node::Gumbo.HTMLElement,
heading_hierarchy::Dict{Symbol,Any},
parsed_blocks::Vector{Dict{String,Any}},
child_new::Bool=true,
- prev_text_buffer::IO=IOBuffer(write=true))
Function to process a node
Arguments
- node: The root HTML node
- heading_hierarchy: Dict used to store metadata
- parsed_blocks: Vector of Dicts to store parsed text and metadata
- childnew: Bool to specify if the current block (child) is part of previous block or not. If it's not, then a new insertion needs to be created in parsedblocks
- prevtextbuffer: IO Buffer which contains previous text
sourceDocsScraper.process_node!
— Methodmultiple dispatch for process_node!() when node is of type Gumbo.HTMLText
sourceDocsScraper.process_paths
— Methodprocess_paths(url::AbstractString; max_chunk_size::Int=MAX_CHUNK_SIZE, min_chunk_size::Int=MIN_CHUNK_SIZE)
Process folders provided in paths
. In each, take all HTML files, scrape them, chunk them and postprocess them.
sourceDocsScraper.remove_duplicates
— Methodremove_duplicates(chunks::AbstractVector{<:AbstractString}, sources::AbstractVector{<:AbstractString})
Remove chunks that are duplicated in the input list of chunks and their corresponding sources.
sourceDocsScraper.remove_short_chunks
— Methodremove_short_chunks(chunks::AbstractVector{<:AbstractString}, sources::AbstractVector{<:AbstractString};
- min_chunk_size::Int=MIN_CHUNK_SIZE, skip_code::Bool=true)
Remove chunks that are shorter than a specified length (min_length
) from the input list of chunks and their corresponding sources.
sourceDocsScraper.remove_urls_from_index
— Functionfunction remove_urls_from_index(index_path::AbstractString, prefix_urls=Vector{<:AbstractString})
Remove chunks and sources corresponding to URLs starting with prefix_urls
sourceDocsScraper.report_artifact
— Methodreport_artifact(fn_output)
Print artifact information
sourceDocsScraper.resolve_url
— Methodresolve_url(base_url::String, extracted_url::String)
Check the extracted URL with the original URL. Return empty String if the extracted URL belongs to a different domain. Return complete URL if there's a directory traversal paths or the extracted URL belongs to the same domain as the base_url
Arguments
- base_url: URL of the page from which other URLs are being extracted
- extractedurl: URL extracted from the baseurl
sourceDocsScraper.roll_up_chunks
— Methodroll_up_chunks(parsed_blocks::Vector{Dict{String,Any}}, url::AbstractString; separator::String="<SEP>")
Roll-up chunks (that have the same header!), so we can split them later by <SEP> to get the desired length
sourceDocsScraper.text_before_version
— Methodtext_before_version(url::AbstractString)
Return text before "stable" or "dev" or any version in URL. It is generally observed that doc websites have package names before their versions
sourceDocsScraper.url_package_name
— Methodurl_package_name(url::AbstractString)
Return the text if the URL itself contains the package name with ".jl" or "_jl" suffixes
sourceDocsScraper.urls_for_metadata
— Methodurls_for_metadata(sources::Vector{String})
Return a Dict of package names with their associated URLs Note: Due to their large number, URLs are stripped down to the package name; Package subpaths are not included in metadata.
sourcePromptingTools.Experimental.RAGTools.get_chunks
— MethodRT.get_chunks(chunker::DocParserChunker, url::AbstractString;
- verbose::Bool=true, separators=["
", ". ", " ", " "], maxchunksize::Int=MAXCHUNKSIZE)
Extract chunks from HTML files, by parsing the content in the HTML, rolling up chunks by headers, and splits them by separators to get the desired length.
Arguments
- chunker: DocParserChunker
- url: URL of the webpage to extract chunks
- verbose: Bool to print the log
- separators: Chunk separators
- maxchunksize Maximum chunk size
sourceSettings
This document was generated with Documenter.jl version 1.5.0 on Thursday 15 August 2024. Using Julia version 1.10.4.
+ prev_text_buffer::IO=IOBuffer(write=true))
Function to process a node
Arguments
- node: The root HTML node
- heading_hierarchy: Dict used to store metadata
- parsed_blocks: Vector of Dicts to store parsed text and metadata
- childnew: Bool to specify if the current block (child) is part of previous block or not. If it's not, then a new insertion needs to be created in parsedblocks
- prevtextbuffer: IO Buffer which contains previous text
DocsScraper.process_node!
— Methodmultiple dispatch for process_node!() when node is of type Gumbo.HTMLText
DocsScraper.process_paths
— Methodprocess_paths(url::AbstractString; max_chunk_size::Int=MAX_CHUNK_SIZE, min_chunk_size::Int=MIN_CHUNK_SIZE)
Process folders provided in paths
. In each, take all HTML files, scrape them, chunk them and postprocess them.
DocsScraper.remove_duplicates
— Methodremove_duplicates(chunks::AbstractVector{<:AbstractString}, sources::AbstractVector{<:AbstractString})
Remove chunks that are duplicated in the input list of chunks and their corresponding sources.
DocsScraper.remove_short_chunks
— Methodremove_short_chunks(chunks::AbstractVector{<:AbstractString}, sources::AbstractVector{<:AbstractString};
+ min_chunk_size::Int=MIN_CHUNK_SIZE, skip_code::Bool=true)
Remove chunks that are shorter than a specified length (min_length
) from the input list of chunks and their corresponding sources.
DocsScraper.remove_urls_from_index
— Functionfunction remove_urls_from_index(index_path::AbstractString, prefix_urls=Vector{<:AbstractString})
Remove chunks and sources corresponding to URLs starting with prefix_urls
DocsScraper.report_artifact
— Methodreport_artifact(fn_output)
Print artifact information
DocsScraper.resolve_url
— Methodresolve_url(base_url::String, extracted_url::String)
Check the extracted URL with the original URL. Return empty String if the extracted URL belongs to a different domain. Return complete URL if there's a directory traversal paths or the extracted URL belongs to the same domain as the base_url
Arguments
- base_url: URL of the page from which other URLs are being extracted
- extractedurl: URL extracted from the baseurl
DocsScraper.roll_up_chunks
— Methodroll_up_chunks(parsed_blocks::Vector{Dict{String,Any}}, url::AbstractString; separator::String="<SEP>")
Roll-up chunks (that have the same header!), so we can split them later by <SEP> to get the desired length
DocsScraper.text_before_version
— Methodtext_before_version(url::AbstractString)
Return text before "stable" or "dev" or any version in URL. It is generally observed that doc websites have package names before their versions
DocsScraper.url_package_name
— Methodurl_package_name(url::AbstractString)
Return the text if the URL itself contains the package name with ".jl" or "_jl" suffixes
DocsScraper.urls_for_metadata
— Methodurls_for_metadata(sources::Vector{String})
Return a Dict of package names with their associated URLs Note: Due to their large number, URLs are stripped down to the package name; Package subpaths are not included in metadata.
PromptingTools.Experimental.RAGTools.get_chunks
— MethodRT.get_chunks(chunker::DocParserChunker, url::AbstractString;
+ verbose::Bool=true, separators=["
", ". ", " ", " "], maxchunksize::Int=MAXCHUNKSIZE)
Extract chunks from HTML files, by parsing the content in the HTML, rolling up chunks by headers, and splits them by separators to get the desired length.
Arguments
- chunker: DocParserChunker
- url: URL of the webpage to extract chunks
- verbose: Bool to print the log
- separators: Chunk separators
- maxchunksize Maximum chunk size