Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Questions about the precision of the trained model #23

Open
ttbuffey opened this issue Apr 29, 2019 · 10 comments
Open

Questions about the precision of the trained model #23

ttbuffey opened this issue Apr 29, 2019 · 10 comments

Comments

@ttbuffey
Copy link

ttbuffey commented Apr 29, 2019

Dear Author,

When we run the model trained with the default dataset in deep-code-search-master/pytorch/data/github, we find the relevance of search result to the input question is not very high, with the highest similarity 0.3;

  1. I want to confirm if the the result of the default dataset is indeed not that relevant? what's the similarity score you tried?
    Question: "convert string to date "
    Result: ('public static String formatSeconds ( Object obj ) { long time = - 1L ; if ( obj instanceof Long ) { time = ( ( Long ) obj ) . longValue ( ) ; } else if ( obj instanceof Integer ) { time = ( ( Integer ) obj ) . intValue ( ) ; } return ( time + "-s" ) ; } \r\n', 0.31213856)

  2. How about using the larger dataset you provided in the "Google Drive"- https://drive.google.com/drive/folders/1GZYLT_lzhlVczXjD6dgwVUvDDPHMB6L7?usp=sharing;
    Will the precision based on the large dataset will be much higher? We haven't got the result yet because it takes quite a long time to train.

Sincerely hope you can help answer, Thx a lot.

@guxd
Copy link
Owner

guxd commented Apr 29, 2019

Yes, you should use the larger dateset. The small data is just for quick setup.

@ttbuffey
Copy link
Author

Can you provide an already-trained model based on the large dataset for me to have a test?
Because our GPU server to train the model is quite slow, for seven days, we only run 11/2000; It still takes quite a long time.
Thx a lot for your quick response and help.

@guxd
Copy link
Owner

guxd commented Apr 29, 2019

@ttbuffey I uploaded the trained model (epoch 500 by Keras) to the data folder in google driver. Please check it out and let me know any questions.

@ttbuffey
Copy link
Author

@guxd Thanks a lot, I'm trying it.

@ttbuffey
Copy link
Author

@guxd We tested the similarity is around 0.4, the result seems more relevant now.
Is this similarity consistent with the supposed right one?

I also want to confirm the epoch500 you just provided, does it contain 18,233,872 records mentioned in the paper?

@guxd
Copy link
Owner

guxd commented Apr 29, 2019

0.4 seems normal. The epoch500 is a model trained with the dataset from Google driver. The data contains 18,233,872 records as mentioned in the paper.

@guxd guxd mentioned this issue Apr 29, 2019
@ahzz1207
Copy link

ahzz1207 commented May 7, 2019

Hello , I used the original data set on your cloud disk, ran 1000 epoch with the original parameter keras model, and the chunksize was 200,000.However, the best val_loss of the optimal model during training is about 0.00016. I used the optimal model to run eval function, and the result top10 is about 0.79, and MRR is about 0.53, which is slightly different from your paper.When search was tested in the use_dataset, the highest similar score was around 0.36. What is the problem?Thank you for your reply!

@guxd
Copy link
Owner

guxd commented May 21, 2019

@ahzz1207 The MRR shown by the program is calculated in a different way to that in the paper. It is automatically computed in the training set. The MRRs shown in the paper is manually calculated by human labeling. 0.36 seems a bit below the expectation. It is usually 0.4
#16

@primary-studyer
Copy link

I retrained a model with 670 epoches,Compare the selected minimum loss which it is ,the loss is 0.000329920450937,so I stop training it.
but I feel the results are irrelevant .

Input Query: convert string to date
How many results? 10

('public static Class < ? > loadSystemClass ( String className ) throws ClassNotFoundException { return Class . forName ( className ) ; } \n', 0.40123203)

('public static BinaryExpression andAssign ( Expression left , Expression right , Method method , LambdaExpression lambdaExpression ) { throw Extensions . todo ( ) ; } \n', 0.40073127)

('public static < TSource , TKey , TElement , TResult > Enumerable < TResult > groupBy ( Enumerable < TSource > enumerable , Function1 < TSource
, TKey > keySelector , Function1 < TSource , TElement > elementSelector , Function2 < TKey , Enumerable < TElement > , TResult > resultSelector ) { throw Extensions . todo ( ) ; } \n', 0.40073127)

('public static void squelchWriter ( Writer writer ) { try { if ( writer != null ) { writer . close ( ) ; } } catch ( IOException ex ) { } } \n',
0.40073127)

('public String getElement ( int index ) throws Exception { if ( index != 0 ) { throw new Exception ( "INTERNAL-ERROR:-invalid-index-" + index +
"-sent-to-AreaMoments:getElement" ) ; } else { return Integer . toString ( number_windows ) ; } } \n', 0.39803487)

('public static byte [ ] loadBinary ( File binFile ) throws IOException { byte [ ] xferBuffer = new byte [ 10240 ] ; byte [ ] outBytes = null ; ByteArrayOutputStream baos ; int i ; FileInputStream fis = new FileInputStream ( binFile ) ; try { baos = new ByteArrayOutputStream ( ) ; while ( ( i = fis . read ( xferBuffer ) ) > 0 ) baos . write ( xferBuffer , 0 , i ) ; outBytes = baos . toByteArray ( ) ; } finally { try { fis . close ( ) ; } catch ( IOException ioe ) { } finally { fis = null ; baos = null ; } } return outBytes ; } \n', 0.39748406)

('@ XmlElementDecl ( namespace = "http://schemas.microsoft.com/2003/10/Serialization1/" , name = "duration" ) public JAXBElement < Duration > cre
ateDuration ( Duration value ) { return new JAXBElement < Duration > ( _Duration_QNAME , Duration . class , null , value ) ; } \n', 0.39748406)

('protected void doFormatValue ( final CharArrayBuffer buffer , final String value , boolean quote ) { if ( ! quote ) { for ( int i = 0 ; ( i < v
alue . length ( ) ) && ! quote ; i ++ ) { quote = isSeparator ( value . charAt ( i ) ) ; } } if ( quote ) { buffer . append ( '"' ) ; } for ( int i = 0 ; i < value . length ( ) ; i ++ ) { char ch = value . charAt ( i ) ; if ( isUnsafe ( ch ) ) { buffer . append ( '|' ) ; } buffer . append ( ch ) ; } if ( quote ) { buffer . append ( '"' ) ; } } \n', 0.39748406)

('public double diagonal ( ) { return Math . sqrt ( Math . pow ( theLength , 2 ) + Math . pow ( theWidth , 2 ) ) ; } \n', 0.39587107)

('protected String buildQuery ( ) throws UnsupportedEncodingException { String timestamp = getTimestampFromLocalTime ( Calendar . getInstance ( )
. getTime ( ) ) ; Map < String , String > queryParams = new TreeMap < String , String > ( ) ; queryParams . put ( "ApplicationName" , application_name ) ; queryParams . put ( "AWSAccessKeyId" , accessKeyId ) ; queryParams . put ( "Description" , "descriptionversion1" ) ; queryParams . put ( "Operation" , ACTION_NAME ) ; queryParams . put ( "SignatureVersion" , "2" ) ; queryParams . put ( "SignatureMethod" , HASH_ALGORITHM ) ; queryParams . put ( "Timestamp" , timestamp ) ; queryParams . put ( "Version" , SERVICE_VERSION ) ; String query = "" ; boolean first = true ; for ( String name : queryParams . keySet ( ) ) { if ( first ) first = false ; else query += "&" ; query += name + "=" + URLEncoder . encode ( queryParams . get ( name ) , "UTF-8" ) ; } return query ; } \n', 0.39587107)

@xdliu1998
Copy link

When I try to reproduce the results, I also have a situation where the results are not relevant. Have you solved the problem?

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

5 participants