Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Unable to get basic example to run #120

Open
theycallmeswift opened this issue Mar 22, 2022 · 30 comments
Open

Unable to get basic example to run #120

theycallmeswift opened this issue Mar 22, 2022 · 30 comments
Assignees
Labels
bug Something isn't working question Further information is requested

Comments

@theycallmeswift
Copy link

Hey, folks --

I'm having trouble getting the basic example provided to run. Specifically the failure I'm encountering is at the daily-osci-rankings stage. I have confirmed that I have a functioning local version of Hadoop installed. Running on Ubuntu 20.04 LTS VPS with a fresh install.

I pulled the two most visible errors from the log out below (full log expandable at bottom of issue). It's unclear to me if they are related though.

Any help pointing me in the right direction would be appreciated!

$ python3 osci-cli.py get-github-daily-push-events -d 2020-01-01
# success
$ python3 osci-cli.py process-github-daily-push-events -d 2020-01-01
# success
$ python3 osci-cli.py daily-osci-rankings -td 2020-01-02
# failure (see full log below)

# ...

[2022-03-22 18:11:11,850] [DEBUG] Answer received: !ysorg.apache.spark.sql.AnalysisException: Unable to infer schema for Parquet. It must be specified manually.;\n	at org.apache.spark.sql.execution.datasources.DataSource.$anonfun$getOrInferFileFormatSchema$12(DataSource.scala:200)\n	at scala.Option.getOrElse(Option.scala:189)\n	at org.apache.spark.sql.execution.datasources.DataSource.getOrInferFileFormatSchema(DataSource.scala:200)\n	at org.apache.spark.sql.execution.datasources.DataSource.resolveRelation(DataSource.scala:408)\n	at org.apache.spark.sql.DataFrameReader.loadV1Source(DataFrameReader.scala:297)\n	at org.apache.spark.sql.DataFrameReader.$anonfun$load$2(DataFrameReader.scala:286)\n	at scala.Option.getOrElse(Option.scala:189)\n	at org.apache.spark.sql.DataFrameReader.load(DataFrameReader.scala:286)\n	at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method)\n	at sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:62)\n	at sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43)\n	at java.lang.reflect.Method.invoke(Method.java:498)\n	at py4j.reflection.MethodInvoker.invoke(MethodInvoker.java:244)\n	at py4j.reflection.ReflectionEngine.invoke(ReflectionEngine.java:357)\n	at py4j.Gateway.invoke(Gateway.java:282)\n	at py4j.commands.AbstractCommand.invokeMethod(AbstractCommand.java:132)\n	at py4j.commands.CallCommand.execute(CallCommand.java:79)\n	at py4j.GatewayConnection.run(GatewayConnection.java:238)\n	at java.lang.Thread.run(Thread.java:748)\n
<osci.datalake.local.landing.LocalLandingArea object at 0x7fa5e8753f40> /data landing
<osci.datalake.local.staging.LocalStagingArea object at 0x7fa5e87609a0> /data staging
<osci.datalake.local.public.LocalPublicArea object at 0x7fa5e8760940> /data public
<osci.datalake.local.web.LocalWebArea object at 0x7fa5e8760a90> /web data

# ...

[2022-03-22 18:11:11,855] [DEBUG] Answer received: !yv
Traceback (most recent call last):
  File "osci-cli.py", line 93, in <module>
    cli(standalone_mode=False)
  File "/home/ubuntu/.local/lib/python3.8/site-packages/click/core.py", line 829, in __call__
    return self.main(*args, **kwargs)
  File "/home/ubuntu/.local/lib/python3.8/site-packages/click/core.py", line 782, in main
    rv = self.invoke(ctx)
  File "/home/ubuntu/.local/lib/python3.8/site-packages/click/core.py", line 1259, in invoke
    return _process_result(sub_ctx.command.invoke(sub_ctx))
  File "/home/ubuntu/.local/lib/python3.8/site-packages/click/core.py", line 1066, in invoke
    return ctx.invoke(self.callback, **ctx.params)
  File "/home/ubuntu/.local/lib/python3.8/site-packages/click/core.py", line 610, in invoke
    return callback(*args, **kwargs)
  File "/home/ubuntu/OSCI/osci/actions/base.py", line 59, in execute
    return self._execute(**self._process_params(kwargs))
  File "/home/ubuntu/OSCI/osci/actions/process/generate_daily_osci_rankings.py", line 49, in _execute
    commits = osci_ranking_job.extract(to_date=to_day).cache()
  File "/home/ubuntu/OSCI/osci/jobs/base.py", line 44, in extract
    commits=Session().load_dataframe(paths=self._get_dataset_paths(to_date, from_date))
  File "/home/ubuntu/OSCI/osci/jobs/session.py", line 39, in load_dataframe
    return self.spark_session.read.load(paths, **options)
  File "/home/ubuntu/.local/lib/python3.8/site-packages/pyspark/sql/readwriter.py", line 182, in load
    return self._df(self._jreader.load(self._spark._sc._jvm.PythonUtils.toSeq(path)))
  File "/home/ubuntu/.local/lib/python3.8/site-packages/py4j/java_gateway.py", line 1304, in __call__
    return_value = get_return_value(
  File "/home/ubuntu/.local/lib/python3.8/site-packages/pyspark/sql/utils.py", line 134, in deco
    raise_from(converted)
  File "<string>", line 3, in raise_from

Full Error Log:
[2022-03-22 18:11:05,996] [INFO] ENV: None
[2022-03-22 18:11:05,997] [DEBUG] Check config file for env local exists
[2022-03-22 18:11:05,997] [DEBUG] Read config from /home/ubuntu/OSCI/osci/config/files/local.yml
[2022-03-22 18:11:06,000] [DEBUG] Prod yml load: {'meta': {'config_source': 'yaml'}, 'file_system': {'type': 'local', 'base_path': '/data'}, 'areas': {'landing': {'container': 'landing'}, 'staging': {'container': 'staging'}, 'public': {'container': 'public'}}, 'bq': {'project': '', 'secret': '{}'}, 'web': {'fs': 'local', 'base_path': '/web', 'account_name': '', 'account_key': '', 'container': 'data'}, 'github': {'token': ''}, 'company': {'default': 'EPAM'}}
[2022-03-22 18:11:06,000] [DEBUG] Prod yml res: {'meta': {'config_source': 'yaml'}, 'file_system': {'type': 'local', 'base_path': '/data'}, 'areas': {'landing': {'container': 'landing'}, 'staging': {'container': 'staging'}, 'public': {'container': 'public'}}, 'bq': {'project': '', 'secret': '{}'}, 'web': {'fs': 'local', 'base_path': '/web', 'account_name': '', 'account_key': '', 'container': 'data'}, 'github': {'token': ''}, 'company': {'default': 'EPAM'}}
[2022-03-22 18:11:06,000] [INFO] Full config: {'meta': {'config_source': 'yaml'}, 'file_system': {'type': 'local', 'base_path': '/data'}, 'areas': {'landing': {'container': 'landing'}, 'staging': {'container': 'staging'}, 'public': {'container': 'public'}}, 'bq': {'project': '', 'secret': '{}'}, 'web': {'fs': 'local', 'base_path': '/web', 'account_name': '', 'account_key': '', 'container': 'data'}, 'github': {'token': ''}, 'company': {'default': 'EPAM'}}
[2022-03-22 18:11:06,000] [INFO] Configuration loaded for env: local
[2022-03-22 18:11:06,000] [DEBUG] Create new <class 'osci.config.base.LocalFileSystemConfig'>
[2022-03-22 18:11:06,000] [DEBUG] {'fs': 'local', 'base_path': '/web', 'account_name': '', 'account_key': '', 'container': 'data'}
[2022-03-22 18:11:06,000] [DEBUG] Create new <class 'osci.config.base.Config'>
[2022-03-22 18:11:06,000] [DEBUG] Create new <class 'osci.datalake.datalake.DataLake'>
[2022-03-22 18:11:06,113] [INFO] Execute action `daily-osci-rankings`
[2022-03-22 18:11:06,113] [INFO] Action params `{'to_day': '2020-01-02'}`
[2022-03-22 18:11:06,114] [DEBUG] Create new <class 'osci.datalake.reports.general.osci_ranking.OSCIRankingFactory'>
[2022-03-22 18:11:06,114] [DEBUG] Create new <class 'osci.datalake.reports.general.commits_ranking.OSCICommitsRankingFactory'>
[2022-03-22 18:11:06,114] [DEBUG] Create new <class 'osci.jobs.session.Session'>
[2022-03-22 18:11:06,115] [DEBUG] Loaded paths for (None 2020-01-02 00:00:00) []
Using Spark's default log4j profile: org/apache/spark/log4j-defaults.properties
Setting default log level to "WARN".
To adjust logging level use sc.setLogLevel(newLevel). For SparkR, use setLogLevel(newLevel).
[2022-03-22 18:11:08,127] [DEBUG] Command to send: A
fb324a0d50b599ec733f3b3b1bc1d7f4d1c894100f14d4ad6f4af9db025d37ea

[2022-03-22 18:11:08,142] [DEBUG] Answer received: !yv
[2022-03-22 18:11:08,142] [DEBUG] Command to send: j
i
rj
org.apache.spark.SparkConf
e

[2022-03-22 18:11:08,143] [DEBUG] Answer received: !yv
[2022-03-22 18:11:08,143] [DEBUG] Command to send: j
i
rj
org.apache.spark.api.java.*
e

[2022-03-22 18:11:08,144] [DEBUG] Answer received: !yv
[2022-03-22 18:11:08,144] [DEBUG] Command to send: j
i
rj
org.apache.spark.api.python.*
e

[2022-03-22 18:11:08,144] [DEBUG] Answer received: !yv
[2022-03-22 18:11:08,144] [DEBUG] Command to send: j
i
rj
org.apache.spark.ml.python.*
e

[2022-03-22 18:11:08,144] [DEBUG] Answer received: !yv
[2022-03-22 18:11:08,144] [DEBUG] Command to send: j
i
rj
org.apache.spark.mllib.api.python.*
e

[2022-03-22 18:11:08,144] [DEBUG] Answer received: !yv
[2022-03-22 18:11:08,144] [DEBUG] Command to send: j
i
rj
org.apache.spark.sql.*
e

[2022-03-22 18:11:08,144] [DEBUG] Answer received: !yv
[2022-03-22 18:11:08,144] [DEBUG] Command to send: j
i
rj
org.apache.spark.sql.api.python.*
e

[2022-03-22 18:11:08,145] [DEBUG] Answer received: !yv
[2022-03-22 18:11:08,145] [DEBUG] Command to send: j
i
rj
org.apache.spark.sql.hive.*
e

[2022-03-22 18:11:08,146] [DEBUG] Answer received: !yv
[2022-03-22 18:11:08,146] [DEBUG] Command to send: j
i
rj
scala.Tuple2
e

[2022-03-22 18:11:08,146] [DEBUG] Answer received: !yv
[2022-03-22 18:11:08,146] [DEBUG] Command to send: r
u
SparkConf
rj
e

[2022-03-22 18:11:08,147] [DEBUG] Answer received: !ycorg.apache.spark.SparkConf
[2022-03-22 18:11:08,148] [DEBUG] Command to send: i
org.apache.spark.SparkConf
bTrue
e

[2022-03-22 18:11:08,154] [DEBUG] Answer received: !yro0
[2022-03-22 18:11:08,154] [DEBUG] Command to send: c
o0
contains
sspark.serializer.objectStreamReset
e

[2022-03-22 18:11:08,158] [DEBUG] Answer received: !ybfalse
[2022-03-22 18:11:08,158] [DEBUG] Command to send: c
o0
set
sspark.serializer.objectStreamReset
s100
e

[2022-03-22 18:11:08,158] [DEBUG] Answer received: !yro1
[2022-03-22 18:11:08,158] [DEBUG] Command to send: m
d
o1
e

[2022-03-22 18:11:08,159] [DEBUG] Answer received: !yv
[2022-03-22 18:11:08,159] [DEBUG] Command to send: c
o0
contains
sspark.rdd.compress
e

[2022-03-22 18:11:08,159] [DEBUG] Answer received: !ybfalse
[2022-03-22 18:11:08,159] [DEBUG] Command to send: c
o0
set
sspark.rdd.compress
sTrue
e

[2022-03-22 18:11:08,159] [DEBUG] Answer received: !yro2
[2022-03-22 18:11:08,159] [DEBUG] Command to send: m
d
o2
e

[2022-03-22 18:11:08,159] [DEBUG] Answer received: !yv
[2022-03-22 18:11:08,160] [DEBUG] Command to send: c
o0
contains
sspark.master
e

[2022-03-22 18:11:08,160] [DEBUG] Answer received: !ybtrue
[2022-03-22 18:11:08,160] [DEBUG] Command to send: c
o0
contains
sspark.app.name
e

[2022-03-22 18:11:08,160] [DEBUG] Answer received: !ybtrue
[2022-03-22 18:11:08,160] [DEBUG] Command to send: c
o0
contains
sspark.master
e

[2022-03-22 18:11:08,160] [DEBUG] Answer received: !ybtrue
[2022-03-22 18:11:08,160] [DEBUG] Command to send: c
o0
get
sspark.master
e

[2022-03-22 18:11:08,161] [DEBUG] Answer received: !yslocal[*]
[2022-03-22 18:11:08,161] [DEBUG] Command to send: c
o0
contains
sspark.app.name
e

[2022-03-22 18:11:08,162] [DEBUG] Answer received: !ybtrue
[2022-03-22 18:11:08,162] [DEBUG] Command to send: c
o0
get
sspark.app.name
e

[2022-03-22 18:11:08,162] [DEBUG] Answer received: !yspyspark-shell
[2022-03-22 18:11:08,162] [DEBUG] Command to send: c
o0
contains
sspark.home
e

[2022-03-22 18:11:08,163] [DEBUG] Answer received: !ybfalse
[2022-03-22 18:11:08,163] [DEBUG] Command to send: c
o0
getAll
e

[2022-03-22 18:11:08,163] [DEBUG] Answer received: !yto3
[2022-03-22 18:11:08,163] [DEBUG] Command to send: a
e
o3
e

[2022-03-22 18:11:08,164] [DEBUG] Answer received: !yi7
[2022-03-22 18:11:08,164] [DEBUG] Command to send: a
g
o3
i0
e

[2022-03-22 18:11:08,164] [DEBUG] Answer received: !yro4
[2022-03-22 18:11:08,164] [DEBUG] Command to send: c
o4
_1
e

[2022-03-22 18:11:08,165] [DEBUG] Answer received: !ysspark.rdd.compress
[2022-03-22 18:11:08,165] [DEBUG] Command to send: c
o4
_2
e

[2022-03-22 18:11:08,165] [DEBUG] Answer received: !ysTrue
[2022-03-22 18:11:08,166] [DEBUG] Command to send: a
e
o3
e

[2022-03-22 18:11:08,166] [DEBUG] Answer received: !yi7
[2022-03-22 18:11:08,166] [DEBUG] Command to send: a
g
o3
i1
e

[2022-03-22 18:11:08,166] [DEBUG] Answer received: !yro5
[2022-03-22 18:11:08,166] [DEBUG] Command to send: c
o5
_1
e

[2022-03-22 18:11:08,166] [DEBUG] Answer received: !ysspark.serializer.objectStreamReset
[2022-03-22 18:11:08,167] [DEBUG] Command to send: c
o5
_2
e

[2022-03-22 18:11:08,167] [DEBUG] Answer received: !ys100
[2022-03-22 18:11:08,167] [DEBUG] Command to send: a
e
o3
e

[2022-03-22 18:11:08,167] [DEBUG] Answer received: !yi7
[2022-03-22 18:11:08,167] [DEBUG] Command to send: a
g
o3
i2
e

[2022-03-22 18:11:08,167] [DEBUG] Answer received: !yro6
[2022-03-22 18:11:08,167] [DEBUG] Command to send: c
o6
_1
e

[2022-03-22 18:11:08,170] [DEBUG] Answer received: !ysspark.master
[2022-03-22 18:11:08,170] [DEBUG] Command to send: c
o6
_2
e

[2022-03-22 18:11:08,171] [DEBUG] Answer received: !yslocal[*]
[2022-03-22 18:11:08,171] [DEBUG] Command to send: a
e
o3
e

[2022-03-22 18:11:08,171] [DEBUG] Answer received: !yi7
[2022-03-22 18:11:08,171] [DEBUG] Command to send: a
g
o3
i3
e

[2022-03-22 18:11:08,171] [DEBUG] Answer received: !yro7
[2022-03-22 18:11:08,171] [DEBUG] Command to send: c
o7
_1
e

[2022-03-22 18:11:08,172] [DEBUG] Answer received: !ysspark.submit.pyFiles
[2022-03-22 18:11:08,172] [DEBUG] Command to send: c
o7
_2
e

[2022-03-22 18:11:08,172] [DEBUG] Answer received: !ys
[2022-03-22 18:11:08,172] [DEBUG] Command to send: a
e
o3
e

[2022-03-22 18:11:08,172] [DEBUG] Answer received: !yi7
[2022-03-22 18:11:08,172] [DEBUG] Command to send: a
g
o3
i4
e

[2022-03-22 18:11:08,173] [DEBUG] Answer received: !yro8
[2022-03-22 18:11:08,173] [DEBUG] Command to send: c
o8
_1
e

[2022-03-22 18:11:08,173] [DEBUG] Answer received: !ysspark.submit.deployMode
[2022-03-22 18:11:08,173] [DEBUG] Command to send: c
o8
_2
e

[2022-03-22 18:11:08,173] [DEBUG] Answer received: !ysclient
[2022-03-22 18:11:08,173] [DEBUG] Command to send: a
e
o3
e

[2022-03-22 18:11:08,173] [DEBUG] Answer received: !yi7
[2022-03-22 18:11:08,173] [DEBUG] Command to send: a
g
o3
i5
e

[2022-03-22 18:11:08,173] [DEBUG] Answer received: !yro9
[2022-03-22 18:11:08,174] [DEBUG] Command to send: c
o9
_1
e

[2022-03-22 18:11:08,174] [DEBUG] Answer received: !ysspark.ui.showConsoleProgress
[2022-03-22 18:11:08,174] [DEBUG] Command to send: c
o9
_2
e

[2022-03-22 18:11:08,174] [DEBUG] Answer received: !ystrue
[2022-03-22 18:11:08,174] [DEBUG] Command to send: a
e
o3
e

[2022-03-22 18:11:08,174] [DEBUG] Answer received: !yi7
[2022-03-22 18:11:08,174] [DEBUG] Command to send: a
g
o3
i6
e

[2022-03-22 18:11:08,174] [DEBUG] Answer received: !yro10
[2022-03-22 18:11:08,175] [DEBUG] Command to send: c
o10
_1
e

[2022-03-22 18:11:08,175] [DEBUG] Answer received: !ysspark.app.name
[2022-03-22 18:11:08,175] [DEBUG] Command to send: c
o10
_2
e

[2022-03-22 18:11:08,175] [DEBUG] Answer received: !yspyspark-shell
[2022-03-22 18:11:08,175] [DEBUG] Command to send: a
e
o3
e

[2022-03-22 18:11:08,175] [DEBUG] Answer received: !yi7
[2022-03-22 18:11:08,175] [DEBUG] Command to send: m
d
o3
e

[2022-03-22 18:11:08,175] [DEBUG] Answer received: !yv
[2022-03-22 18:11:08,175] [DEBUG] Command to send: r
u
JavaSparkContext
rj
e

[2022-03-22 18:11:08,186] [DEBUG] Answer received: !ycorg.apache.spark.api.java.JavaSparkContext
[2022-03-22 18:11:08,186] [DEBUG] Command to send: i
org.apache.spark.api.java.JavaSparkContext
ro0
e

[2022-03-22 18:11:09,483] [DEBUG] Answer received: !yro11
[2022-03-22 18:11:09,483] [DEBUG] Command to send: c
o11
sc
e

[2022-03-22 18:11:09,489] [DEBUG] Answer received: !yro12
[2022-03-22 18:11:09,490] [DEBUG] Command to send: c
o12
conf
e

[2022-03-22 18:11:09,499] [DEBUG] Answer received: !yro13
[2022-03-22 18:11:09,500] [DEBUG] Command to send: r
u
PythonAccumulatorV2
rj
e

[2022-03-22 18:11:09,501] [DEBUG] Answer received: !ycorg.apache.spark.api.python.PythonAccumulatorV2
[2022-03-22 18:11:09,502] [DEBUG] Command to send: i
org.apache.spark.api.python.PythonAccumulatorV2
s127.0.0.1
i45879
sfb324a0d50b599ec733f3b3b1bc1d7f4d1c894100f14d4ad6f4af9db025d37ea
e

[2022-03-22 18:11:09,502] [DEBUG] Answer received: !yro14
[2022-03-22 18:11:09,502] [DEBUG] Command to send: c
o11
sc
e

[2022-03-22 18:11:09,502] [DEBUG] Answer received: !yro15
[2022-03-22 18:11:09,503] [DEBUG] Command to send: c
o15
register
ro14
e

[2022-03-22 18:11:09,505] [DEBUG] Answer received: !yv
[2022-03-22 18:11:09,505] [DEBUG] Command to send: r
u
PythonUtils
rj
e

[2022-03-22 18:11:09,506] [DEBUG] Answer received: !ycorg.apache.spark.api.python.PythonUtils
[2022-03-22 18:11:09,506] [DEBUG] Command to send: r
m
org.apache.spark.api.python.PythonUtils
isEncryptionEnabled
e

[2022-03-22 18:11:09,506] [DEBUG] Answer received: !ym
[2022-03-22 18:11:09,506] [DEBUG] Command to send: c
z:org.apache.spark.api.python.PythonUtils
isEncryptionEnabled
ro11
e

[2022-03-22 18:11:09,507] [DEBUG] Answer received: !ybfalse
[2022-03-22 18:11:09,508] [DEBUG] Command to send: r
u
org
rj
e

[2022-03-22 18:11:09,509] [DEBUG] Answer received: !yp
[2022-03-22 18:11:09,510] [DEBUG] Command to send: r
u
org.apache
rj
e

[2022-03-22 18:11:09,510] [DEBUG] Answer received: !yp
[2022-03-22 18:11:09,510] [DEBUG] Command to send: r
u
org.apache.spark
rj
e

[2022-03-22 18:11:09,510] [DEBUG] Answer received: !yp
[2022-03-22 18:11:09,511] [DEBUG] Command to send: r
u
org.apache.spark.SparkFiles
rj
e

[2022-03-22 18:11:09,511] [DEBUG] Answer received: !ycorg.apache.spark.SparkFiles
[2022-03-22 18:11:09,511] [DEBUG] Command to send: r
m
org.apache.spark.SparkFiles
getRootDirectory
e

[2022-03-22 18:11:09,511] [DEBUG] Answer received: !ym
[2022-03-22 18:11:09,511] [DEBUG] Command to send: c
z:org.apache.spark.SparkFiles
getRootDirectory
e

[2022-03-22 18:11:09,512] [DEBUG] Answer received: !ys/tmp/spark-133764be-4844-4a91-a340-210c1b419fda/userFiles-58b63090-eb7f-4872-8939-2710678287d1
[2022-03-22 18:11:09,512] [DEBUG] Command to send: c
o13
get
sspark.submit.pyFiles
s
e

[2022-03-22 18:11:09,512] [DEBUG] Answer received: !ys
[2022-03-22 18:11:09,513] [DEBUG] Command to send: r
u
org
rj
e

[2022-03-22 18:11:09,514] [DEBUG] Answer received: !yp
[2022-03-22 18:11:09,514] [DEBUG] Command to send: r
u
org.apache
rj
e

[2022-03-22 18:11:09,515] [DEBUG] Answer received: !yp
[2022-03-22 18:11:09,515] [DEBUG] Command to send: r
u
org.apache.spark
rj
e

[2022-03-22 18:11:09,515] [DEBUG] Answer received: !yp
[2022-03-22 18:11:09,515] [DEBUG] Command to send: r
u
org.apache.spark.util
rj
e

[2022-03-22 18:11:09,515] [DEBUG] Answer received: !yp
[2022-03-22 18:11:09,516] [DEBUG] Command to send: r
u
org.apache.spark.util.Utils
rj
e

[2022-03-22 18:11:09,517] [DEBUG] Answer received: !ycorg.apache.spark.util.Utils
[2022-03-22 18:11:09,517] [DEBUG] Command to send: r
m
org.apache.spark.util.Utils
getLocalDir
e

[2022-03-22 18:11:09,519] [DEBUG] Answer received: !ym
[2022-03-22 18:11:09,519] [DEBUG] Command to send: c
o11
sc
e

[2022-03-22 18:11:09,519] [DEBUG] Answer received: !yro16
[2022-03-22 18:11:09,519] [DEBUG] Command to send: c
o16
conf
e

[2022-03-22 18:11:09,520] [DEBUG] Answer received: !yro17
[2022-03-22 18:11:09,520] [DEBUG] Command to send: c
z:org.apache.spark.util.Utils
getLocalDir
ro17
e

[2022-03-22 18:11:09,520] [DEBUG] Answer received: !ys/tmp/spark-133764be-4844-4a91-a340-210c1b419fda
[2022-03-22 18:11:09,520] [DEBUG] Command to send: r
u
org
rj
e

[2022-03-22 18:11:09,521] [DEBUG] Answer received: !yp
[2022-03-22 18:11:09,521] [DEBUG] Command to send: r
u
org.apache
rj
e

[2022-03-22 18:11:09,522] [DEBUG] Answer received: !yp
[2022-03-22 18:11:09,522] [DEBUG] Command to send: r
u
org.apache.spark
rj
e

[2022-03-22 18:11:09,522] [DEBUG] Answer received: !yp
[2022-03-22 18:11:09,522] [DEBUG] Command to send: r
u
org.apache.spark.util
rj
e

[2022-03-22 18:11:09,523] [DEBUG] Answer received: !yp
[2022-03-22 18:11:09,523] [DEBUG] Command to send: r
u
org.apache.spark.util.Utils
rj
e

[2022-03-22 18:11:09,523] [DEBUG] Answer received: !ycorg.apache.spark.util.Utils
[2022-03-22 18:11:09,523] [DEBUG] Command to send: r
m
org.apache.spark.util.Utils
createTempDir
e

[2022-03-22 18:11:09,523] [DEBUG] Answer received: !ym
[2022-03-22 18:11:09,524] [DEBUG] Command to send: c
z:org.apache.spark.util.Utils
createTempDir
s/tmp/spark-133764be-4844-4a91-a340-210c1b419fda
spyspark
e

[2022-03-22 18:11:09,524] [DEBUG] Answer received: !yro18
[2022-03-22 18:11:09,524] [DEBUG] Command to send: c
o18
getAbsolutePath
e

[2022-03-22 18:11:09,525] [DEBUG] Answer received: !ys/tmp/spark-133764be-4844-4a91-a340-210c1b419fda/pyspark-bc66966b-69a0-4a5b-b7ab-b0b7c8e45101
[2022-03-22 18:11:09,525] [DEBUG] Command to send: c
o13
get
sspark.python.profile
sfalse
e

[2022-03-22 18:11:09,525] [DEBUG] Answer received: !ysfalse
[2022-03-22 18:11:09,525] [DEBUG] Command to send: r
u
SparkSession
rj
e

[2022-03-22 18:11:09,544] [DEBUG] Answer received: !ycorg.apache.spark.sql.SparkSession
[2022-03-22 18:11:09,545] [DEBUG] Command to send: r
m
org.apache.spark.sql.SparkSession
getDefaultSession
e

[2022-03-22 18:11:09,567] [DEBUG] Answer received: !ym
[2022-03-22 18:11:09,567] [DEBUG] Command to send: c
z:org.apache.spark.sql.SparkSession
getDefaultSession
e

[2022-03-22 18:11:09,568] [DEBUG] Answer received: !yro19
[2022-03-22 18:11:09,568] [DEBUG] Command to send: c
o19
isDefined
e

[2022-03-22 18:11:09,569] [DEBUG] Answer received: !ybfalse
[2022-03-22 18:11:09,569] [DEBUG] Command to send: r
u
SparkSession
rj
e

[2022-03-22 18:11:09,570] [DEBUG] Answer received: !ycorg.apache.spark.sql.SparkSession
[2022-03-22 18:11:09,570] [DEBUG] Command to send: c
o11
sc
e

[2022-03-22 18:11:09,571] [DEBUG] Answer received: !yro20
[2022-03-22 18:11:09,571] [DEBUG] Command to send: i
org.apache.spark.sql.SparkSession
ro20
e

[2022-03-22 18:11:09,620] [DEBUG] Answer received: !yro21
[2022-03-22 18:11:09,620] [DEBUG] Command to send: c
o21
sqlContext
e

[2022-03-22 18:11:09,621] [DEBUG] Answer received: !yro22
[2022-03-22 18:11:09,621] [DEBUG] Command to send: r
u
SparkSession
rj
e

[2022-03-22 18:11:09,622] [DEBUG] Answer received: !ycorg.apache.spark.sql.SparkSession
[2022-03-22 18:11:09,622] [DEBUG] Command to send: r
m
org.apache.spark.sql.SparkSession
setDefaultSession
e

[2022-03-22 18:11:09,623] [DEBUG] Answer received: !ym
[2022-03-22 18:11:09,623] [DEBUG] Command to send: c
z:org.apache.spark.sql.SparkSession
setDefaultSession
ro21
e

[2022-03-22 18:11:09,623] [DEBUG] Answer received: !yv
[2022-03-22 18:11:09,623] [DEBUG] Command to send: r
u
SparkSession
rj
e

[2022-03-22 18:11:09,624] [DEBUG] Answer received: !ycorg.apache.spark.sql.SparkSession
[2022-03-22 18:11:09,624] [DEBUG] Command to send: r
m
org.apache.spark.sql.SparkSession
setActiveSession
e

[2022-03-22 18:11:09,624] [DEBUG] Answer received: !ym
[2022-03-22 18:11:09,624] [DEBUG] Command to send: c
z:org.apache.spark.sql.SparkSession
setActiveSession
ro21
e

[2022-03-22 18:11:09,625] [DEBUG] Answer received: !yv
[2022-03-22 18:11:09,625] [DEBUG] Command to send: c
o22
read
e

[2022-03-22 18:11:10,432] [DEBUG] Answer received: !yro23
[2022-03-22 18:11:10,432] [DEBUG] Command to send: r
u
PythonUtils
rj
e

[2022-03-22 18:11:10,433] [DEBUG] Answer received: !ycorg.apache.spark.api.python.PythonUtils
[2022-03-22 18:11:10,433] [DEBUG] Command to send: r
m
org.apache.spark.api.python.PythonUtils
toSeq
e

[2022-03-22 18:11:10,433] [DEBUG] Answer received: !ym
[2022-03-22 18:11:10,433] [DEBUG] Command to send: i
java.util.ArrayList
e

[2022-03-22 18:11:10,433] [DEBUG] Answer received: !ylo24
[2022-03-22 18:11:10,434] [DEBUG] Command to send: c
z:org.apache.spark.api.python.PythonUtils
toSeq
ro24
e

[2022-03-22 18:11:10,434] [DEBUG] Answer received: !yro25
[2022-03-22 18:11:10,434] [DEBUG] Command to send: m
d
o24
e

[2022-03-22 18:11:10,435] [DEBUG] Answer received: !yv
[2022-03-22 18:11:10,435] [DEBUG] Command to send: c
o23
load
ro25
e

22/03/22 18:11:10 WARN DataSource: All paths were ignored:
  

[Stage 0:>                                                          (0 + 1) / 1]

                                                                                
[2022-03-22 18:11:11,839] [DEBUG] Answer received: !xro26
[2022-03-22 18:11:11,839] [DEBUG] Command to send: c
o26
toString
e

[2022-03-22 18:11:11,840] [DEBUG] Answer received: !ysorg.apache.spark.sql.AnalysisException: Unable to infer schema for Parquet. It must be specified manually.;
[2022-03-22 18:11:11,840] [DEBUG] Command to send: c
o26
getCause
e

[2022-03-22 18:11:11,840] [DEBUG] Answer received: !yn
[2022-03-22 18:11:11,840] [DEBUG] Command to send: r
u
org
rj
e

[2022-03-22 18:11:11,842] [DEBUG] Answer received: !yp
[2022-03-22 18:11:11,842] [DEBUG] Command to send: r
u
org.apache
rj
e

[2022-03-22 18:11:11,844] [DEBUG] Answer received: !yp
[2022-03-22 18:11:11,844] [DEBUG] Command to send: r
u
org.apache.spark
rj
e

[2022-03-22 18:11:11,848] [DEBUG] Answer received: !yp
[2022-03-22 18:11:11,848] [DEBUG] Command to send: r
u
org.apache.spark.util
rj
e

[2022-03-22 18:11:11,849] [DEBUG] Answer received: !yp
[2022-03-22 18:11:11,849] [DEBUG] Command to send: r
u
org.apache.spark.util.Utils
rj
e

[2022-03-22 18:11:11,849] [DEBUG] Answer received: !ycorg.apache.spark.util.Utils
[2022-03-22 18:11:11,849] [DEBUG] Command to send: r
m
org.apache.spark.util.Utils
exceptionString
e

[2022-03-22 18:11:11,849] [DEBUG] Answer received: !ym
[2022-03-22 18:11:11,849] [DEBUG] Command to send: c
z:org.apache.spark.util.Utils
exceptionString
ro26
e

[2022-03-22 18:11:11,850] [DEBUG] Answer received: !ysorg.apache.spark.sql.AnalysisException: Unable to infer schema for Parquet. It must be specified manually.;\n	at org.apache.spark.sql.execution.datasources.DataSource.$anonfun$getOrInferFileFormatSchema$12(DataSource.scala:200)\n	at scala.Option.getOrElse(Option.scala:189)\n	at org.apache.spark.sql.execution.datasources.DataSource.getOrInferFileFormatSchema(DataSource.scala:200)\n	at org.apache.spark.sql.execution.datasources.DataSource.resolveRelation(DataSource.scala:408)\n	at org.apache.spark.sql.DataFrameReader.loadV1Source(DataFrameReader.scala:297)\n	at org.apache.spark.sql.DataFrameReader.$anonfun$load$2(DataFrameReader.scala:286)\n	at scala.Option.getOrElse(Option.scala:189)\n	at org.apache.spark.sql.DataFrameReader.load(DataFrameReader.scala:286)\n	at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method)\n	at sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:62)\n	at sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43)\n	at java.lang.reflect.Method.invoke(Method.java:498)\n	at py4j.reflection.MethodInvoker.invoke(MethodInvoker.java:244)\n	at py4j.reflection.ReflectionEngine.invoke(ReflectionEngine.java:357)\n	at py4j.Gateway.invoke(Gateway.java:282)\n	at py4j.commands.AbstractCommand.invokeMethod(AbstractCommand.java:132)\n	at py4j.commands.CallCommand.execute(CallCommand.java:79)\n	at py4j.GatewayConnection.run(GatewayConnection.java:238)\n	at java.lang.Thread.run(Thread.java:748)\n
<osci.datalake.local.landing.LocalLandingArea object at 0x7fa5e8753f40> /data landing
<osci.datalake.local.staging.LocalStagingArea object at 0x7fa5e87609a0> /data staging
<osci.datalake.local.public.LocalPublicArea object at 0x7fa5e8760940> /data public
<osci.datalake.local.web.LocalWebArea object at 0x7fa5e8760a90> /web data
[2022-03-22 18:11:11,852] [DEBUG] Command to send: m
d
o0
e

[2022-03-22 18:11:11,853] [DEBUG] Answer received: !yv
[2022-03-22 18:11:11,853] [DEBUG] Command to send: m
d
o4
e

[2022-03-22 18:11:11,853] [DEBUG] Answer received: !yv
[2022-03-22 18:11:11,853] [DEBUG] Command to send: m
d
o5
e

[2022-03-22 18:11:11,853] [DEBUG] Answer received: !yv
[2022-03-22 18:11:11,853] [DEBUG] Command to send: m
d
o6
e

[2022-03-22 18:11:11,853] [DEBUG] Answer received: !yv
[2022-03-22 18:11:11,853] [DEBUG] Command to send: m
d
o7
e

[2022-03-22 18:11:11,853] [DEBUG] Answer received: !yv
[2022-03-22 18:11:11,853] [DEBUG] Command to send: m
d
o8
e

[2022-03-22 18:11:11,854] [DEBUG] Answer received: !yv
[2022-03-22 18:11:11,854] [DEBUG] Command to send: m
d
o9
e

[2022-03-22 18:11:11,854] [DEBUG] Answer received: !yv
[2022-03-22 18:11:11,854] [DEBUG] Command to send: m
d
o10
e

[2022-03-22 18:11:11,854] [DEBUG] Answer received: !yv
[2022-03-22 18:11:11,854] [DEBUG] Command to send: m
d
o12
e

[2022-03-22 18:11:11,854] [DEBUG] Answer received: !yv
[2022-03-22 18:11:11,854] [DEBUG] Command to send: m
d
o15
e

[2022-03-22 18:11:11,854] [DEBUG] Answer received: !yv
[2022-03-22 18:11:11,854] [DEBUG] Command to send: m
d
o16
e

[2022-03-22 18:11:11,854] [DEBUG] Answer received: !yv
[2022-03-22 18:11:11,855] [DEBUG] Command to send: m
d
o17
e

[2022-03-22 18:11:11,855] [DEBUG] Answer received: !yv
[2022-03-22 18:11:11,855] [DEBUG] Command to send: m
d
o18
e

[2022-03-22 18:11:11,855] [DEBUG] Answer received: !yv
[2022-03-22 18:11:11,855] [DEBUG] Command to send: m
d
o19
e

[2022-03-22 18:11:11,855] [DEBUG] Answer received: !yv
[2022-03-22 18:11:11,855] [DEBUG] Command to send: m
d
o20
e

[2022-03-22 18:11:11,855] [DEBUG] Answer received: !yv
Traceback (most recent call last):
  File "osci-cli.py", line 93, in <module>
    cli(standalone_mode=False)
  File "/home/ubuntu/.local/lib/python3.8/site-packages/click/core.py", line 829, in __call__
    return self.main(*args, **kwargs)
  File "/home/ubuntu/.local/lib/python3.8/site-packages/click/core.py", line 782, in main
    rv = self.invoke(ctx)
  File "/home/ubuntu/.local/lib/python3.8/site-packages/click/core.py", line 1259, in invoke
    return _process_result(sub_ctx.command.invoke(sub_ctx))
  File "/home/ubuntu/.local/lib/python3.8/site-packages/click/core.py", line 1066, in invoke
    return ctx.invoke(self.callback, **ctx.params)
  File "/home/ubuntu/.local/lib/python3.8/site-packages/click/core.py", line 610, in invoke
    return callback(*args, **kwargs)
  File "/home/ubuntu/OSCI/osci/actions/base.py", line 59, in execute
    return self._execute(**self._process_params(kwargs))
  File "/home/ubuntu/OSCI/osci/actions/process/generate_daily_osci_rankings.py", line 49, in _execute
    commits = osci_ranking_job.extract(to_date=to_day).cache()
  File "/home/ubuntu/OSCI/osci/jobs/base.py", line 44, in extract
    commits=Session().load_dataframe(paths=self._get_dataset_paths(to_date, from_date))
  File "/home/ubuntu/OSCI/osci/jobs/session.py", line 39, in load_dataframe
    return self.spark_session.read.load(paths, **options)
  File "/home/ubuntu/.local/lib/python3.8/site-packages/pyspark/sql/readwriter.py", line 182, in load
    return self._df(self._jreader.load(self._spark._sc._jvm.PythonUtils.toSeq(path)))
  File "/home/ubuntu/.local/lib/python3.8/site-packages/py4j/java_gateway.py", line 1304, in __call__
    return_value = get_return_value(
  File "/home/ubuntu/.local/lib/python3.8/site-packages/pyspark/sql/utils.py", line 134, in deco
    raise_from(converted)
  File "<string>", line 3, in raise_from
pyspark.sql.utils.AnalysisException[2022-03-22 18:11:11,879] [DEBUG] Command to send: r
u
org
rj
e

[2022-03-22 18:11:11,881] [DEBUG] Answer received: !yp
[2022-03-22 18:11:11,881] [DEBUG] Command to send: r
u
org.apache
rj
e

[2022-03-22 18:11:11,882] [DEBUG] Answer received: !yp
[2022-03-22 18:11:11,882] [DEBUG] Command to send: r
u
org.apache.spark
rj
e

[2022-03-22 18:11:11,882] [DEBUG] Answer received: !yp
[2022-03-22 18:11:11,882] [DEBUG] Command to send: r
u
org.apache.spark.sql
rj
e

[2022-03-22 18:11:11,882] [DEBUG] Answer received: !yp
[2022-03-22 18:11:11,882] [DEBUG] Command to send: r
u
org.apache.spark.sql.internal
rj
e

[2022-03-22 18:11:11,883] [DEBUG] Answer received: !yp
[2022-03-22 18:11:11,883] [DEBUG] Command to send: r
u
org.apache.spark.sql.internal.SQLConf
rj
e

[2022-03-22 18:11:11,883] [DEBUG] Answer received: !ycorg.apache.spark.sql.internal.SQLConf
[2022-03-22 18:11:11,883] [DEBUG] Command to send: r
m
org.apache.spark.sql.internal.SQLConf
get
e

[2022-03-22 18:11:11,885] [DEBUG] Answer received: !ym
[2022-03-22 18:11:11,885] [DEBUG] Command to send: c
z:org.apache.spark.sql.internal.SQLConf
get
e

[2022-03-22 18:11:11,885] [DEBUG] Answer received: !yro27
[2022-03-22 18:11:11,885] [DEBUG] Command to send: c
o27
pysparkJVMStacktraceEnabled
e

[2022-03-22 18:11:11,886] [DEBUG] Answer received: !ybfalse
: Unable to infer schema for Parquet. It must be specified manually.;
[2022-03-22 18:11:11,924] [DEBUG] Command to send: m
d
o27
e

[2022-03-22 18:11:11,927] [DEBUG] Answer received: !yv
[2022-03-22 18:11:11,965] [DEBUG] Command to send: m
d
o26
e

[2022-03-22 18:11:11,966] [DEBUG] Answer received: !yv
[2022-03-22 18:11:11,966] [DEBUG] Command to send: m
d
o25
e

[2022-03-22 18:11:11,966] [DEBUG] Answer received: !yv
[2022-03-22 18:11:11,966] [DEBUG] Command to send: m
d
o23
e

[2022-03-22 18:11:11,966] [DEBUG] Answer received: !yv

@cm-howard cm-howard added the question Further information is requested label Mar 24, 2022
@theycallmeswift
Copy link
Author

@cm-howard any thoughts on this? Alternatively, would appreciate anything you could do to point me in the right direction

@vlad-isayko
Copy link
Collaborator

@theycallmeswift are there any files in the '/data' dir?

@theycallmeswift
Copy link
Author

@vlad-isayko yep!

python3 osci-cli.py get-github-daily-push-events -d YYYY-MM-DD produces YYYY-MM-DD-[0-23].parquet files in /data/landing/github/events/push/YYYY/MM/DD/

and

python3 osci-cli.py process-github-daily-push-events -d YYYY-MM-DD produces COMPANY-YYYY-MM-DD.parquet files in /data/staging/github/raw-events/push/YYYY/MM/DD

@jerpelea
Copy link

jerpelea commented Jun 9, 2022

@theycallmeswift
I have a similar error on Ubuntu 20.04
Did you manage to fix the error locally?

@theycallmeswift
Copy link
Author

@jerpelea I did not unfortunately. The docs need a serious overhaul from someone who knows the system better than me!

@vlad-isayko
Copy link
Collaborator

@theycallmeswift @jerpelea Hello, the problem is really outdated and incomplete documentation. We will fix this in the coming days. I'll keep you posted

@jerpelea
Copy link

@vlad-isayko can you share some quick update here before updating the documentation

@vlad-isayko
Copy link
Collaborator

At the moment, this is the current way to start

  1. python3 osci-cli.py get-github-daily-push-events -d 2020-01-01
  2. python3 osci-cli.py process-github-daily-push-events -d 2020-01-01
  3. python3 osci-cli.py daily-active-repositories -d 2020-01-01
  4. python3 osci-cli.py load-repositories -d 2020-01-01
  5. python3 osci-cli.py filter-unlicensed -d 2020-01-01
  6. python3 osci-cli.py daily-osci-rankings -td 2020-01-01
  7. python3 osci-cli.py get-change-report -d 2020-01-01

You can write to me if you have any problems

@jerpelea
Copy link

jerpelea commented Jun 10, 2022

@vlad-isayko

Thanks for your quick answer

Everything behaved normal until step 6
python3 osci-cli.py daily-osci-rankings -td 2020-01-01

attached is the log
log.log

I am running Ubuntu 20.04 with python 3.8

@vlad-isayko
Copy link
Collaborator

@jerpelea can you also share what version of pyspark and spark do you have?

@jerpelea
Copy link

jerpelea commented Jun 10, 2022

@vlad-isayko

packages from .local/lib/python3.8/site-packages
installed by pip install -r requirements.txt

aiohttp-3.8.1.dist-info
aiosignal-1.2.0.dist-info
async_timeout-4.0.2.dist-info
attrs-21.4.0.dist-info
azure_common-1.1.25.dist-info
azure_core-1.7.0.dist-info
azure_functions-1.3.0.dist-info
azure_functions_durable-1.1.3.dist-info
azure_nspkg-3.0.2.dist-info
azure_storage_blob-12.3.2.dist-info
azure_storage_common-2.1.0.dist-info
azure_storage_nspkg-3.1.0.dist-info
cachetools-4.2.4.dist-info
charset_normalizer-2.0.12.dist-info
click-7.1.2.dist-info
deepmerge-0.1.1.dist-info
frozenlist-1.3.0.dist-info
furl-2.1.3.dist-info
google_api_core-1.31.5.dist-info
googleapis_common_protos-1.56.1.dist-info
google_auth-1.35.0.dist-info
google_cloud_bigquery-1.25.0.dist-info
google_cloud_core-1.7.2.dist-info
google_resumable_media-0.5.1.dist-info
iniconfig-1.1.1.dist-info
isodate-0.6.1.dist-info
Jinja2-2.11.3.dist-info
MarkupSafe-2.0.1.dist-info
more_itertools-8.13.0.dist-info
msrest-0.6.21.dist-info
multidict-6.0.2.dist-info
numpy-1.19.5.dist-info
orderedmultidict-1.0.1.dist-info
packaging-21.3.dist-info
pandas-1.0.3.dist-info
pbr-5.9.0.dist-info
pip-22.1.2.dist-info
pluggy-0.13.1.dist-info
protobuf-4.21.1.dist-info
py-1.11.0.dist-info
py4j-0.10.9.dist-info
pyarrow-0.17.1.dist-info
pyasn1-0.4.8.dist-info
pyasn1_modules-0.2.8.dist-info
pypandoc-1.5.dist-info
pyparsing-3.0.9.dist-info
pyspark-3.0.1.dist-info
pytest-6.0.1.dist-info
python_dateutil-2.8.1.dist-info
PyYAML-5.4.dist-info
requests_oauthlib-1.3.1.dist-info
rsa-4.8.dist-info
six-1.13.0.dist-info
testresources-2.0.1.dist-info
toml-0.10.2.dist-info
XlsxWriter-1.2.3.dist-inf

@vlad-isayko
Copy link
Collaborator

@jerpelea may be there are some problems with parquet file. We need to check it

@jerpelea
Copy link

jerpelea commented Jun 13, 2022

@vlad-isayko what version are you using? Do you have any suggestions how to check it?

@vlad-isayko
Copy link
Collaborator

@jerpelea we use the same libraries with the same versions. Can you share some files that generated in staging area?

@jerpelea
Copy link

@vlad-isayko thanks for your quick answer
Here is the file
repository-2021-01-01.zip

@vlad-isayko
Copy link
Collaborator

@jerpelea

Is there any files in /staging/github/events/push/2021/01/01/?

Before step 6 there should be files in directories:

  • /staging/github/raw-events/push/2021/01/01/
  • /staging/github/repository/2021/01/
  • /staging/github/events/push/2021/01/01/

@jerpelea
Copy link

@vlad-isayko I have
/landing/githug/events/push/2021/01/01/
/staging/github/raw-events/push/2021/01/01/
/staging/github/repository/2021/01/

there is no /staging/github/events/push/2021/01/01/

Thanks

@vlad-isayko
Copy link
Collaborator

@jerpelea

Can you rerun step 5 python3 osci-cli.py filter-unlicensed -d 2020-01-01 and share logs from this command?

I think that there some problem at this step.

@jerpelea
Copy link

@vlad-isayko attached are the log file and some result files

filter-unlicensed.zip
github.zip

thanks

@vlad-isayko
Copy link
Collaborator

@jerpelea

Ok, it's strange that repository file in staging is empty...
Is there this file /landing/github/repository/2021/01/2021-01-01.csv?
Can you share it?

@jerpelea
Copy link

@vlad-isayko
Copy link
Collaborator

@jerpelea

So the error occurred at step 4 when getting information about the repositories from the Github API.

I ran this step on my own with your source file and I will then check the output.

Could you check your config for a valid github api token?

github:
  token: '394***************************************77'

@jerpelea
Copy link

jerpelea commented Jun 13, 2022

@vlad-isayko
thanks for pointing it out
I think that token setup is a missing step on the README
I added the token in local.yml and restarted step 4

this is how the logs look now
[2022-06-13 09:42:38,265] [INFO] Get repository MinCiencia/Datos-COVID19 information
[2022-06-13 09:42:38,265] [DEBUG] Make request to Github API method=GET, url=https://api.github.com/repos/MinCiencia/Datos-COVID19, kwargs={}
[2022-06-13 09:42:38,485] [DEBUG] https://api.github.com:443 "GET /repos/MinCiencia/Datos-COVID19 HTTP/1.1" 200 None
[2022-06-13 09:42:38,486] [DEBUG] Get response[200] from Github API method=GET, url=https://api.github.com/repos/MinCiencia/Datos-COVID19, kwargs={'headers': {'Authorization': 'token gxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxdo'}}

I will keep you updated on the progress
Thanks for support

@jerpelea
Copy link

@vlad-isayko new errors at step6
daily-osci-rankings.zip

@vlad-isayko
Copy link
Collaborator

@jerpelea

Can you share this files:

  • /data/staging/github/events/push/2020/01/01/unity_technologies-2020-01-01.parquet
  • /data/staging/github/events/push/2020/01/01/secops_solutions-2020-01-01.parquet
  • /data/staging/github/events/push/2020/01/01/luxoft-2020-01-01.parquet
  • /data/staging/github/events/push/2020/01/01/lyft-2020-01-01.parquet
  • /data/staging/github/events/push/2020/01/01/cloudbees-2020-01-01.parquet

@jerpelea
Copy link

jerpelea commented Jun 13, 2022

@vlad-isayko

sure!
here are the parquet files

files.zip

@vlad-isayko
Copy link
Collaborator

@jerpelea

Ok, there is a bug in saving pandas dataframe in parquet format. A column where all None values are converted to Int32 when stored.

This case is quite rare, apparently because of this we did not catch this bug earlier.

We plan to fix this bug.

At the moment, you can resave these files in the correct conversion.

@vlad-isayko vlad-isayko added the bug Something isn't working label Jun 13, 2022
@vlad-isayko vlad-isayko self-assigned this Jun 13, 2022
@jerpelea
Copy link

@vlad-isayko how do I resave them ?

@vlad-isayko
Copy link
Collaborator

@jerpelea

You can run this simple script. Or can share files from /data/staging/github/events/push/, so I can do it for you

import pandas as pd
from pathlib import Path

for path in Path('/data/staging/github/events/push/').rglob('*.parquet'):
    pd.read_parquet(path).astype({'language': str, 'org_name': str}).to_parquet(path, index=False)

@jerpelea
Copy link

jerpelea commented Jun 13, 2022

@vlad-isayko thanks for the fix

It fixed the issue and step 6 completed

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
bug Something isn't working question Further information is requested
Projects
None yet
Development

No branches or pull requests

4 participants