Move code from src-d/ml #1

Guillemdb · 2019-04-16T11:39:25Z

This is the first stage of the refactor of src-d/ml into ml-core.

The code is moved to ml-core and the imports are fixed.
The tests pass.
Most of the CI pipeline works correctly.
Added black for styling (@vmarkovtsev approved)

Signed-off-by: guillemdb <g.duran@instadeep.com>

Signed-off-by: guillemdb <guillem@sourced.tech>

zurk · 2019-04-16T11:45:06Z

Well, as soon as it is almost the same files we have in src-d/ml can you be more specific how do you change it if you did. What I should pay more attention to?

Guillemdb · 2019-04-16T13:30:45Z

The files are the same, the only stuff I did was:

Move al UAST related files into the uast folder.
Change the imports to match the new project name.
Remove references to pyspark in the README files.
Change the references in the CI/setup files from srcd.ml to ml_core.
Apply black as a code formatter.

The functionality has not been changes, I just changed the file locations.

zurk

Thanks, @Guillemdb for great work!

I suggest keeping in mind one general point while we splitting sourced-ml package.

Everything that we commit to ml-core should work, be related and consistent.
I do not mean code, but the rest: links, descriptions, etc. Otherwise, it will be harder to maintain. And we will deal with remains of tails from the deprecated package. I do not suggest to adopt it in the PR but rather just exclude it for now until we need this part (see my suggestion about readme for example). I hope I spotted most of the places with old and unrelated things. Please check if there is something else.

There is a list of things I was not able to comment because they are in the initial commit and not in the PR:

Name of the project in contributing.md should be changed.
Maybe we should update a year from license to 2019. I think Marcelo knows better, please ask him.
Maintainers the last bracket is not escaped should be \) instead of ')'
No doc directory means no SUMMARY.md. let's delete this file.

And two more general things.

package name. I think core_ml in not the best choice because usually python packages do not have an underscore in its name. I think what we should do in to have sourced.ml.core package and share a namespace with sourced.ml.mining. I suggest using sourced prefix to avoid possible overlapping with others ml packages. (@vmarkovtsev please approve).
I do not see .travis.yml file. But tests are included. Do you plan to add it later?

Dockerfile

Makefile

README.md

ml_core/__init__.py

ml_core/extractors/identifiers.py

setup.py

Guillemdb · 2019-04-17T11:49:57Z

I pushed a new commit with the requested changes and the .travis.yml file. I corrected the flake8 issues in another branch, if you allow me to ignore them in the next PR they will be fixed.

In the .travis.yml file I have commented the deployment stage untill I get the correct environment variables.

Btw, @marnovo told me that there is no consensus on the year of the license, so I didn't make any change to it.

Signed-off-by: Guillem Duran <guillem@sourced.tech>

Guillemdb · 2019-04-17T13:52:19Z

The build is failing because I remove the references to bblfsh on the MAKEFILE. Should I add it again so the build passes?

Signed-off-by: Guillem Duran <guillem@sourced.tech>

zurk · 2019-04-17T13:55:50Z

oh, sure. In this case, we should keep it.

Signed-off-by: Guillem Duran <guillem@sourced.tech>

Guillemdb · 2019-04-17T15:05:33Z

When I run the tests on my local repository I find that only one of them is failing:

======================================================================
FAIL: test_result (ml_core.tests.test_uast_to_id_distance.Uast2IdLineDistanceTest)
----------------------------------------------------------------------
Traceback (most recent call last):
  File "/home/guillem/srcd/ml-core/ml_core/tests/test_uast_to_id_distance.py", line 141, in test_result
    self.assertEqual(res, correct)
AssertionError: Lists differ: [(('__spec__', 'ModuleSpec'), 0), (('__spec__',[1679 chars], 1)] != [(('__package__', 'ModuleSpec'), 2), (('__spec_[2142 chars], 1)]

First differing element 0:
(('__spec__', 'ModuleSpec'), 0)
(('__package__', 'ModuleSpec'), 2)

Second list contains 14 additional elements.
First extra element 55:
(('utmain', 'collections'), 2)

+ [(('__package__', 'ModuleSpec'), 2),
- [(('__spec__', 'ModuleSpec'), 0),
? ^

+  (('__spec__', 'ModuleSpec'), 0),
? ^

   (('__spec__', 'ModuleSpec'), 1),
   (('__spec__', 'ModuleSpec'), 1),
+  (('__spec__', 'ModuleSpec'), 2),
   (('__spec__', '__package__'), 0),
+  (('collections', 'ModuleSpec'), 1),
   (('collections', 'ModuleSpec'), 2),
   (('collections', '__package__'), 1),
   (('collections', '__spec__'), 1),
+  (('collections', '__spec__'), 2),
   (('modules', '__package__'), 1),
   (('modules', '__spec__'), 1),
   (('modules', 'collections'), 2),
   (('namedtuple', 'ModuleSpec'), 0),
   (('namedtuple', 'ModuleSpec'), 1),
+  (('namedtuple', 'ModuleSpec'), 1),
   (('namedtuple', 'ModuleSpec'), 2),
   (('namedtuple', 'ModuleSpec'), 2),
   (('namedtuple', '__package__'), 1),
+  (('namedtuple', '__package__'), 2),
   (('namedtuple', '__spec__'), 1),
   (('namedtuple', '__spec__'), 1),
+  (('namedtuple', '__spec__'), 2),
+  (('namedtuple', '__spec__'), 2),
   (('namedtuple', 'collections'), 0),
-  (('namedtuple', 'collections'), 2),
?                                  ^

+  (('namedtuple', 'collections'), 1),
?                                  ^

   (('namedtuple', 'modules'), 2),
   (('setup_logging', 'modelforge.logs'), 0),
   (('setup_logging', 'setup'), 1),
   (('sys', '__package__'), 1),
   (('sys', '__spec__'), 1),
   (('sys', 'collections'), 2),
   (('sys', 'modelforge.logs'), 2),
   (('sys', 'modules'), 0),
   (('sys', 'namedtuple'), 2),
   (('sys', 'setup_logging'), 2),
   (('utmain', 'ModuleSpec'), 0),
   (('utmain', 'ModuleSpec'), 1),
   (('utmain', 'ModuleSpec'), 1),
   (('utmain', 'ModuleSpec'), 1),
   (('utmain', 'ModuleSpec'), 2),
+  (('utmain', 'ModuleSpec'), 2),
+  (('utmain', 'ModuleSpec'), 2),
   (('utmain', '__package__'), 0),
   (('utmain', '__package__'), 0),
   (('utmain', '__package__'), 1),
   (('utmain', '__spec__'), 0),
   (('utmain', '__spec__'), 0),
   (('utmain', '__spec__'), 0),
   (('utmain', '__spec__'), 1),
   (('utmain', '__spec__'), 2),
   (('utmain', 'collections'), 1),
   (('utmain', 'collections'), 1),
   (('utmain', 'collections'), 2),
+  (('utmain', 'collections'), 2),
   (('utmain', 'modules'), 0),
   (('utmain', 'modules'), 1),
   (('utmain', 'modules'), 1),
   (('utmain', 'namedtuple'), 1),
   (('utmain', 'namedtuple'), 1),
   (('utmain', 'namedtuple'), 1),
   (('utmain', 'namedtuple'), 2),
+  (('utmain', 'namedtuple'), 2),
+  (('utmain', 'namedtuple'), 2),
+  (('utmain', 'namedtuple'), 2),
   (('utmain', 'sys'), 0),
   (('utmain', 'sys'), 1),
   (('utmain', 'sys'), 1)]

Does the file test/source/example.py depend on the project from where it is called?
Does this test somehow fail for something I did wrong?

However, during the CI, a lot of tests fail with a similar error to this one>

ERROR: test_greatest2 (ml_core.tests.test_df.DocumentFrequenciesTests)
----------------------------------------------------------------------
Traceback (most recent call last):
  File "/ml_core/ml_core/tests/test_df.py", line 10, in setUp
    self.model = DocumentFrequencies().load(source=paths.DOCFREQ)
  File "/usr/local/lib/python3.6/dist-packages/modelforge/model.py", line 93, in load
    cache_dir = os.path.join(self.cache_dir(), self.NAME)
  File "/usr/local/lib/python3.6/dist-packages/modelforge/model.py", line 338, in cache_dir
    raise RuntimeError("modelforge is not configured; look at modelforge.configuration. "
RuntimeError: modelforge is not configured; look at modelforge.configuration. Depending on your objective you may or may not want to create a modelforgecfg.py file which sets VENDOR and the rest.

The modelforgecfg.py is created, but somehow the path to the files is "/ml_core/ml_core/tests/test_df.py", and the folder ml_core is duplicated. Do you have any idea of why this happens?

I am creating an additional copy of ml_core and moving the whole project inside that folder, but I'm not sure where. Could it be in the L29 of the Dockerfile?

zurk · 2019-04-17T20:32:49Z

I will check tomorrow.

Signed-off-by: Konstantin Slavnov <konstantin@sourced.tech>

zurk · 2019-04-18T15:15:33Z

I add my changes to the commit. Now everything works fine.
There are several things that @Guillemdb before I can approve it I ask you to do several more things:

Namespace thing is approved by Vadim (I asked him in DM).
1. Put everything you have in core_ml to sourced/ml/core. Address all core_ml entries in code accordingly.
2. Since we do not have the second package yet (sourced.ml.mining) keep __init__.py empty from pkg_resources.declare_namespace(__name__) we will add it later. If we add it now, it is not possible to test it.
Search for all sourced.ml entries in the PR and address them accordingly.
Add flake8-commas==2.0.0 to requrements-lint.txt and address new style issues.

zurk · 2019-04-18T15:20:29Z

things that should be addressed in the next PRs (@Guillemdb you can convert this list to issues after we merge this PR):

Add a directory for documentation and documentation-related checks to Travis as we have in https://github.com/src-d/style-analyzer.
Add the deployment stage to Travis
If we use black styler we should have a check in Travis that it was applied.
Add dependabot as we have in https://github.com/src-d/ml (see its PRs: Bump flake8-quotes from 1.0.0 to 2.0.0 ml#406)

Signed-off-by: Guillem Duran <guillem@sourced.tech>

Guillemdb · 2019-04-22T07:06:08Z

I finally managed to fix the CI and upload the following changes:

The namespace of the library from ml_core to sourced.ml.core.
Updated CI and configuration files accordingly to comply with the new name.
I didn't implement flake8-commas==2.0.0 because black already takes care of it. (I would like to add black as a pre-commit hook in future PRs.)
Renamed the folder sourced.ml.core.tests.models to sourced.ml.core.tests.models_tensorflow to avoit conflicts with the file sourced.ml.core.tests.models.py.
Skipped the test test_serialize from tests.models_tensorflow.test_tensorflow due to an unknown error with stack-trace:

======================================================================
ERROR: test_serialize (ml.core.tests.models_tensorflow.test_tensorflow.TensorFlowModelTests)
----------------------------------------------------------------------
Traceback (most recent call last):
  File "/home/travis/virtualenv/python3.5.6/lib/python3.5/site-packages/sourced/ml/core/tests/models_tensorflow/test_tensorflow.py", line 18, in test_serialize
    TensorFlowModel().construct(graphdef=gd).save(buffer)
  File "/home/travis/virtualenv/python3.5.6/lib/python3.5/site-packages/modelforge/model.py", line 390, in save
    raise ValueError("series must be specified")
ValueError: series must be specified

zurk · 2019-04-22T07:57:18Z

I didn't implement flake8-commas==2.0.0 because black already takes care of it. (I would like to add black as a pre-commit hook in future PRs.)

Ok. One thing, you should add not a pre-commit hook but another style check (see #1 (comment)). I am fine if you do it in the next PR. In the best case, black check should completely replace flake8.

Signed-off-by: Konstantin Slavnov <konstantin@sourced.tech>

zurk

only one minor thing.

README.md

smola · 2019-04-22T08:31:15Z

@Guillemdb @vmarkovtsev It would be desirable to preserve history or, at least, separate commits for code import and code changes, so it's clear the point at which it's imported and what changes were made. Please, ping me on Slack if there's any doubt about how to do this.

Guillemdb · 2019-04-22T08:34:24Z

So far there is no changes to the code functionality besides fixing the Tensorflow test, I was waiting for this PR to be merged to make any change to the core-ml functionality if needed.

zurk

Let's wait for @vmarkovtsev input also.

Signed-off-by: Guillem Duran <guillem@sourced.tech>

smola · 2019-04-22T09:05:55Z

@Guillemdb But there were non-functional changes, right? Or just things like imports?

Guillemdb · 2019-04-22T09:11:52Z

Yeah, all the changed were non-functional, what I did was:

Resetting the .md files to a minified version that only mentions the core-mlproject.
Copy-pasting the files to the new repo.
Fixing the imports to match the new namespace specifications.
Apply black formatting to minimize future diffs.
Change the CI so it's defined as in the lookout-sdk-ml project (as @vmarkovtsev suggested).

vmarkovtsev · 2019-04-23T15:43:25Z

@Guillemdb It is impossible for me to review 118 files. I will comment and the thing will turn into an unmanageable mess. Can you please split all the changes into feature-full 5-6 PRs and submit them one by one. It is fine that each following PR depends on the previous ones.

Move the text files
Add CI. The repo is empty so it passes. CI should additionally test packaging btw.
Add the old code. Pass CI.
Add black formatting. Yes, that is important, (3) must be exactly the same old code apart from the CI fixes.
Add the rest of the files, e.g. autodoc generation

guillemdb and others added 3 commits April 11, 2019 16:20

Add requirements, linters, flake8, coveragerc and gitignore

d8b8989

Signed-off-by: guillemdb <g.duran@instadeep.com>

Add first refactor

6f52e2a

Signed-off-by: guillemdb <guillem@sourced.tech>

Add black style

382ff19

Signed-off-by: guillemdb <guillem@sourced.tech>

Guillemdb assigned zurk Apr 16, 2019

Guillemdb requested a review from zurk April 16, 2019 11:40

zurk removed their assignment Apr 16, 2019

zurk suggested changes Apr 17, 2019

View reviewed changes

Guillem Duran added 5 commits April 17, 2019 15:34

Fix non source code files to be consistent with project refactor

d8144fd

Signed-off-by: Guillem Duran <guillem@sourced.tech>

Fix project path

995a2a9

Signed-off-by: Guillem Duran <guillem@sourced.tech>

Add E203 to ignore because it is handled by black

5d348fd

Signed-off-by: Guillem Duran <guillem@sourced.tech>

Rename files inside the uast folder to remove 'uast_' from the file name

d9c163c

Signed-off-by: Guillem Duran <guillem@sourced.tech>

Make fixes to pass flake8

71bd199

Signed-off-by: Guillem Duran <guillem@sourced.tech>

Guillemdb requested a review from zurk April 17, 2019 13:50

Fix typo in ml_core path

9100cd7

Signed-off-by: Guillem Duran <guillem@sourced.tech>

Guillem Duran added 4 commits April 17, 2019 15:57

Add bblfsh-start

145df36

Signed-off-by: Guillem Duran <guillem@sourced.tech>

Remove todo annotation

9454d17

Signed-off-by: Guillem Duran <guillem@sourced.tech>

Fix imports

75cb193

Signed-off-by: Guillem Duran <guillem@sourced.tech>

Format using black

5bd31aa

Signed-off-by: Guillem Duran <guillem@sourced.tech>

Fix setup, Dockerfile, Makefile, tests and CI

11de713

Signed-off-by: Konstantin Slavnov <konstantin@sourced.tech>

Refactor files to meet new namespace specifications.

9b026ba

Signed-off-by: Guillem Duran <guillem@sourced.tech>

Fix tensorflow test

9ecfbcb

Signed-off-by: Konstantin Slavnov <konstantin@sourced.tech>

zurk suggested changes Apr 22, 2019

View reviewed changes

README.md Outdated Show resolved Hide resolved

zurk approved these changes Apr 22, 2019

View reviewed changes

Fix typo

4571b25

Signed-off-by: Guillem Duran <guillem@sourced.tech>

vmarkovtsev changed the title ~~Refactor~~ Move code from src-d/ml Apr 23, 2019

vmarkovtsev closed this Apr 23, 2019

Guillemdb mentioned this pull request Apr 24, 2019

Refactor #4

Merged

Move code from src-d/ml #1

Move code from src-d/ml #1

Uh oh!

Conversation

Guillemdb commented Apr 16, 2019 • edited by smola Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

zurk commented Apr 16, 2019

Uh oh!

Guillemdb commented Apr 16, 2019

Uh oh!

zurk left a comment

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Guillemdb commented Apr 17, 2019 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

Guillemdb commented Apr 17, 2019

Uh oh!

zurk commented Apr 17, 2019

Uh oh!

Guillemdb commented Apr 17, 2019 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

zurk commented Apr 17, 2019

Uh oh!

zurk commented Apr 18, 2019

Uh oh!

zurk commented Apr 18, 2019

Uh oh!

Guillemdb commented Apr 22, 2019

Uh oh!

zurk commented Apr 22, 2019

Uh oh!

zurk left a comment • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Uh oh!

smola commented Apr 22, 2019 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

Guillemdb commented Apr 22, 2019

Uh oh!

zurk left a comment

Choose a reason for hiding this comment

Uh oh!

smola commented Apr 22, 2019 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

Guillemdb commented Apr 22, 2019 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

vmarkovtsev commented Apr 23, 2019

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

4 participants

Guillemdb commented Apr 16, 2019 •

edited by smola

Loading

Guillemdb commented Apr 17, 2019 •

edited

Loading

Guillemdb commented Apr 17, 2019 •

edited

Loading

zurk left a comment •

edited

Loading

smola commented Apr 22, 2019 •

edited

Loading

smola commented Apr 22, 2019 •

edited

Loading

Guillemdb commented Apr 22, 2019 •

edited

Loading