Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

add default cache_knowledge to google bucket urls #575

Open
wants to merge 1 commit into
base: master
Choose a base branch
from

Conversation

milovate
Copy link
Contributor

@milovate milovate commented Dec 26, 2024

Q/A checklist

  • If you add new dependencies, did you update the lock file?
poetry lock --no-update
  • Run tests
ulimit -n unlimited && ./scripts/run-tests.sh
  • Do a self code review of the changes - Read the diff at least twice.
  • Carefully think about the stuff that might break because of this change - this sounds obvious but it's easy to forget to do "Go to references" on each function you're changing and see if it's used in a way you didn't expect.
  • The relevant pages still run when you press submit
  • The API for those pages still work (API tab)
  • The public API interface doesn't change if you didn't want it to (check API tab > docs page)
  • Do your UI changes (if applicable) look acceptable on mobile?
  • Ensure you have not regressed the import time unless you have a good reason to do so.
    You can visualize this using tuna:
python3 -X importtime -c 'import server' 2> out.log && tuna out.log

To measure import time for a specific library:

$ time python -c 'import pandas'

________________________________________________________
Executed in    1.15 secs    fish           external
   usr time    2.22 secs   86.00 micros    2.22 secs
   sys time    0.72 secs  613.00 micros    0.72 secs

To reduce import times, import libraries that take a long time inside the functions that use them instead of at the top of the file:

def my_function():
    import pandas as pd
    ...

Legal Boilerplate

Look, I get it. The entity doing business as “Gooey.AI” and/or “Dara.network” was incorporated in the State of Delaware in 2020 as Dara Network Inc. and is gonna need some rights from me in order to utilize my contributions in this PR. So here's the deal: I retain all rights, title and interest in and to my contributions, and by keeping this boilerplate intact I confirm that Dara Network Inc can use, modify, copy, and redistribute my contributions, under its choice of terms.

@milovate milovate marked this pull request as ready for review December 26, 2024 14:27
Copy link

@greptile-apps greptile-apps bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

PR Summary

Added caching support for Google bucket URLs and improved document embedding handling in vector search functionality.

  • Modified get_or_create_embedded_file in vector_search.py to handle user-uploaded URLs differently and validate file metadata before reusing cached embeddings
  • Added support for multiple leaf URLs from a single input URL in document embedding creation process
  • Removed flatmap_parallel function in favor of more direct metadata handling
  • Updated Vespa database integration for more efficient document storage and retrieval

💡 (2/5) Greptile learns from your feedback when you react with 👍/👎!

1 file(s) reviewed, no comment(s)
Edit PR Review Bot Settings | Greptile

max_context_words=max_context_words,
scroll_jump=scroll_jump,
google_translate_target=google_translate_target or "",
selected_asr_model=selected_asr_model or "",
embedding_model=embedding_model.name,
)
file_meta = None
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Should this be handled better ?

@milovate milovate requested a review from devxpy December 26, 2024 14:30
@milovate milovate self-assigned this Dec 26, 2024
@milovate milovate changed the title feat: add default cache_knowledge to google bucket urls, updated vesp… add default cache_knowledge to google bucket urls Dec 26, 2024
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

1 participant