-
Notifications
You must be signed in to change notification settings - Fork 99
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Unable to get basic example to run #120
Comments
@cm-howard any thoughts on this? Alternatively, would appreciate anything you could do to point me in the right direction |
@theycallmeswift are there any files in the '/data' dir? |
@vlad-isayko yep!
and
|
@theycallmeswift |
@jerpelea I did not unfortunately. The docs need a serious overhaul from someone who knows the system better than me! |
@theycallmeswift @jerpelea Hello, the problem is really outdated and incomplete documentation. We will fix this in the coming days. I'll keep you posted |
@vlad-isayko can you share some quick update here before updating the documentation |
At the moment, this is the current way to start
You can write to me if you have any problems |
Thanks for your quick answer Everything behaved normal until step 6 attached is the log I am running Ubuntu 20.04 with python 3.8 |
@jerpelea can you also share what version of pyspark and spark do you have? |
packages from .local/lib/python3.8/site-packages aiohttp-3.8.1.dist-info |
@jerpelea may be there are some problems with parquet file. We need to check it |
@vlad-isayko what version are you using? Do you have any suggestions how to check it? |
@jerpelea we use the same libraries with the same versions. Can you share some files that generated in staging area? |
@vlad-isayko thanks for your quick answer |
Is there any files in Before step 6 there should be files in directories:
|
@vlad-isayko I have there is no /staging/github/events/push/2021/01/01/ Thanks |
Can you rerun step 5 I think that there some problem at this step. |
@vlad-isayko attached are the log file and some result files filter-unlicensed.zip thanks |
Ok, it's strange that repository file in staging is empty... |
So the error occurred at step 4 when getting information about the repositories from the Github API. I ran this step on my own with your source file and I will then check the output. Could you check your config for a valid github api token? github:
token: '394***************************************77' |
@vlad-isayko this is how the logs look now I will keep you updated on the progress |
@vlad-isayko new errors at step6 |
Can you share this files:
|
sure! |
Ok, there is a bug in saving pandas dataframe in parquet format. A column where all None values are converted to Int32 when stored. This case is quite rare, apparently because of this we did not catch this bug earlier. We plan to fix this bug. At the moment, you can resave these files in the correct conversion. |
@vlad-isayko how do I resave them ? |
You can run this simple script. Or can share files from import pandas as pd
from pathlib import Path
for path in Path('/data/staging/github/events/push/').rglob('*.parquet'):
pd.read_parquet(path).astype({'language': str, 'org_name': str}).to_parquet(path, index=False) |
@vlad-isayko thanks for the fix It fixed the issue and step 6 completed |
Hey, folks --
I'm having trouble getting the basic example provided to run. Specifically the failure I'm encountering is at the
daily-osci-rankings
stage. I have confirmed that I have a functioning local version of Hadoop installed. Running on Ubuntu 20.04 LTS VPS with a fresh install.I pulled the two most visible errors from the log out below (full log expandable at bottom of issue). It's unclear to me if they are related though.
Any help pointing me in the right direction would be appreciated!
Full Error Log:
The text was updated successfully, but these errors were encountered: