Skip to content

Commit

Permalink
removed installations in examples
Browse files Browse the repository at this point in the history
  • Loading branch information
splendidbug committed Aug 24, 2024
1 parent 6c15260 commit 77eb3a8
Show file tree
Hide file tree
Showing 11 changed files with 44 additions and 65 deletions.
21 changes: 21 additions & 0 deletions README.md
Original file line number Diff line number Diff line change
Expand Up @@ -100,3 +100,24 @@ using AIHelpMe: last_result
# last_result() returns the last result from the RAG pipeline, ie, same as running aihelp(; return_all=true)
print(last_result())
```
## Output
`make_knowledge_packs` creates the following files:
```
index_name\
├── Index\
│ ├── index_name__artifact__info.txt
│ ├── index_name__vDate__model_embedding_size-embedding_type__v1.0.hdf5
│ └── index_name__vDate__model_embedding_size-embedding_type__v1.0.tar.gz
├── Scraped_files\
│ ├── scraped_hostname-chunks-max-chunk_size-min-min_chunk_size.jls
│ ├── scraped_hostname-sources-max-chunk_size-min-min_chunk_size.jls
│ └── . . .
└── index_name_URL_mapping.csv
```
- Index directory contains the .hdf5 and .tar.gz files along with the artifact__info.txt. Artifact info contains sha256 and git-tree-sha1 hashes. 
- Scraped_files directory contains the scraped chunks and sources. These are separated by the hostnames of the URLs.
- URL_mapping.csv contains the scraped URLs mapping them with the estimated package name.
21 changes: 21 additions & 0 deletions docs/src/index.md
Original file line number Diff line number Diff line change
Expand Up @@ -97,3 +97,24 @@ Tip: Use `pprint` for nicer outputs with sources and `last_result` for more deta
using AIHelpMe: last_result
print(last_result())
```
## Output
`make_knowledge_packs` creates the following files:
```
index_name\
├── Index\
│ ├── index_name__artifact__info.txt
│ ├── index_name__vDate__model_embedding_size-embedding_type__v1.0.hdf5
│ └── index_name__vDate__model_embedding_size-embedding_type__v1.0.tar.gz
├── Scraped_files\
│ ├── scraped_hostname-chunks-max-chunk_size-min-min_chunk_size.jls
│ ├── scraped_hostname-sources-max-chunk_size-min-min_chunk_size.jls
│ └── . . .
└── index_name_URL_mapping.csv
```
- Index directory contains the .hdf5 and .tar.gz files along with the artifact__info.txt. Artifact info contains sha256 and git-tree-sha1 hashes. 
- Scraped_files directory contains the scraped chunks and sources. These are separated by the hostnames of the URLs.
- URL_mapping.csv contains the scraped URLs mapping them with the estimated package name.
Original file line number Diff line number Diff line change
@@ -1,18 +1,9 @@
# The example below demonstrates the creation of Genie knowledge pack

using Pkg
Pkg.activate(temp = true)
Pkg.add(url = "https://github.com/JuliaGenAI/DocsScraper.jl")
using DocsScraper

# The crawler will run on these URLs to look for more URLs with the same hostname
crawlable_urls = ["https://learn.genieframework.com/"]

index_path = make_knowledge_packs(crawlable_urls;
target_path = joinpath("knowledge_packs", "dim=3072;chunk_size=384;Float32"), index_name = "genie", custom_metadata = "Genie ecosystem")

# The index created here has 1024 embedding dimensions with boolean embeddings and max chunk size is 384.

# The above example creates an output directory index_name which contains the sub-directories "Scraped" and "Index".
# "Scraped" contains .jls files of chunks and sources of the scraped URLs. Index contains the created index along with a .txt file
# containing the artifact info. The output directory also contains the URL mapping csv.
Original file line number Diff line number Diff line change
@@ -1,8 +1,5 @@
# The example below demonstrates the creation of JuliaData knowledge pack

using Pkg
Pkg.activate(temp = true)
Pkg.add(url = "https://github.com/JuliaGenAI/DocsScraper.jl")
using DocsScraper

# The crawler will run on these URLs to look for more URLs with the same hostname
Expand All @@ -25,9 +22,3 @@ single_page_urls = ["https://docs.julialang.org/en/v1/manual/missing/",

index_path = make_knowledge_packs(crawlable_urls; single_urls = single_page_urls,
target_path = joinpath("knowledge_packs", "dim=3072;chunk_size=384;Float32"), index_name = "juliadata", custom_metadata = "JuliaData ecosystem")

# The index created here has 1024 embedding dimensions with boolean embeddings and max chunk size is 384.

# The above example creates an output directory index_name which contains the sub-directories "Scraped" and "Index".
# "Scraped" contains .jls files of chunks and sources of the scraped URLs. Index contains the created index along with a .txt file
# containing the artifact info. The output directory also contains the URL mapping csv.
Original file line number Diff line number Diff line change
@@ -1,8 +1,5 @@
# The example below demonstrates the creation of JuliaLang knowledge pack

using Pkg
Pkg.activate(temp = true)
Pkg.add(url = "https://github.com/JuliaGenAI/DocsScraper.jl")
using DocsScraper

# The crawler will run on these URLs to look for more URLs with the same hostname
Expand All @@ -16,9 +13,3 @@ crawlable_urls = [
index_path = make_knowledge_packs(crawlable_urls;
target_path = joinpath("knowledge_packs", "dim=3072;chunk_size=384;Float32"),
index_name = "julialang", custom_metadata = "JuliaLang ecosystem")

# The index created here has 1024 embedding dimensions with boolean embeddings and max chunk size is 384.

# The above example creates an output directory index_name which contains the sub-directories "Scraped" and "Index".
# "Scraped" contains .jls files of chunks and sources of the scraped URLs. Index contains the created index along with a .txt file
# containing the artifact info. The output directory also contains the URL mapping csv.
Original file line number Diff line number Diff line change
@@ -1,8 +1,5 @@
# The example below demonstrates the creation of Makie knowledge pack

using Pkg
Pkg.activate(temp = true)
Pkg.add(url = "https://github.com/JuliaGenAI/DocsScraper.jl")
using DocsScraper

# The crawler will run on these URLs to look for more URLs with the same hostname
Expand All @@ -18,9 +15,3 @@ crawlable_urls = ["https://docs.juliahub.com/MakieGallery/Ql23q/0.2.17/",

index_path = make_knowledge_packs(crawlable_urls;
target_path = joinpath("knowledge_packs", "dim=3072;chunk_size=384;Float32"), index_name = "makie", custom_metadata = "Makie ecosystem")

# The index created here has 1024 embedding dimensions with boolean embeddings and max chunk size is 384.

# The above example creates an output directory index_name which contains the sub-directories "Scraped" and "Index".
# "Scraped" contains .jls files of chunks and sources of the scraped URLs. Index contains the created index along with a .txt file
# containing the artifact info. The output directory also contains the URL mapping csv.
Original file line number Diff line number Diff line change
@@ -1,8 +1,5 @@
# The example below demonstrates the creation of plots knowledge pack

using Pkg
Pkg.activate(temp = true)
Pkg.add(url = "https://github.com/JuliaGenAI/DocsScraper.jl")
using DocsScraper

# The crawler will run on these URLs to look for more URLs with the same hostname
Expand All @@ -21,9 +18,3 @@ crawlable_urls = [

index_path = make_knowledge_packs(crawlable_urls;
target_path = joinpath("knowledge_packs", "dim=3072;chunk_size=384;Float32"), index_name = "plots", custom_metadata = "Plots ecosystem")

# The index created here has 1024 embedding dimensions with boolean embeddings and max chunk size is 384.

# The above example creates an output directory index_name which contains the sub-directories "Scraped" and "Index".
# "Scraped" contains .jls files of chunks and sources of the scraped URLs. Index contains the created index along with a .txt file
# containing the artifact info. The output directory also contains the URL mapping csv.
Original file line number Diff line number Diff line change
@@ -1,8 +1,5 @@
# The example below demonstrates the creation of SciML knowledge pack

using Pkg
Pkg.activate(temp = true)
Pkg.add(url = "https://github.com/JuliaGenAI/DocsScraper.jl")
using DocsScraper

# The crawler will run on these URLs to look for more URLs with the same hostname
Expand Down Expand Up @@ -43,9 +40,3 @@ single_page_urls = [

index_path = make_knowledge_packs(crawlable_urls; single_urls = single_page_urls,
target_path = joinpath("knowledge_packs", "dim=3072;chunk_size=384;Float32"), index_name = "sciml", custom_metadata = "SciML ecosystem")

# The index created here has 1024 embedding dimensions with boolean embeddings and max chunk size is 384.

# The above example creates an output directory index_name which contains the sub-directories "Scraped" and "Index".
# "Scraped" contains .jls files of chunks and sources of the scraped URLs. Index contains the created index along with a .txt file
# containing the artifact info. The output directory also contains the URL mapping csv.
Original file line number Diff line number Diff line change
@@ -1,8 +1,5 @@
# The example below demonstrates the creation of Tidier knowledge pack

using Pkg
Pkg.activate(temp = true)
Pkg.add(url = "https://github.com/JuliaGenAI/DocsScraper.jl")
using DocsScraper

# The crawler will run on these URLs to look for more URLs with the same hostname
Expand All @@ -13,9 +10,3 @@ crawlable_urls = ["https://tidierorg.github.io/Tidier.jl/dev/",

index_path = make_knowledge_packs(crawlable_urls;
target_path = joinpath("knowledge_packs", "dim=3072;chunk_size=384;Float32"), index_name = "tidier", custom_metadata = "Tidier ecosystem")

# The index created here has 1024 embedding dimensions with boolean embeddings and max chunk size is 384.

# The above example creates an output directory index_name which contains the sub-directories "Scraped" and "Index".
# "Scraped" contains .jls files of chunks and sources of the scraped URLs. Index contains the created index along with a .txt file
# containing the artifact info. The output directory also contains the URL mapping csv.
2 changes: 1 addition & 1 deletion src/extract_package_name.jl
Original file line number Diff line number Diff line change
Expand Up @@ -4,7 +4,7 @@
Strip URL of any http:// ot https:// or www. prefixes
"""
function clean_url(url::String)
# Remove http://, https://, www., or wwws.
# Remove http://, https://, www.
cleaned_url = replace(url, r"^https?://(www\d?\.)?" => "")
return cleaned_url
end
Expand Down
2 changes: 1 addition & 1 deletion src/make_knowledge_packs.jl
Original file line number Diff line number Diff line change
Expand Up @@ -70,7 +70,7 @@ end
single_urls::Vector{<:AbstractString} = String[], target_path::AbstractString = "", index_name::AbstractString = "")
Validate args. Return error if both `crawlable_urls` and `single_urls` are empty.
Create a target path if input path is invalid. Create a gensym index if the input index is inavlid.
Create a target path if input path is invalid. Create a gensym index if the input index is invalid.
# Arguments
- crawlable_urls: URLs that should be crawled to find more links
Expand Down

0 comments on commit 77eb3a8

Please sign in to comment.